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Introduction 

Malaria  is  caused  by  apicomplexan  parasites  of  the  genus  Plasmodium.  It  is  a  major 
public  health  problem  in  many  tropical  areas  of  the  world,  and  also  affects  many  individuals  and 
military  forces  that  visit  these  areas.  In  1994  the  World  Health  Organization  estimated  that  there 
were  300-500  million  cases  and  up  to  2.7  million  deaths  caused  by  malaria  each  year,  and 
because  of  increased  parasite  resistance  to  chloroquine  and  other  antimalarials  the  situation  is 
expected  to  worsen  considerably  [1].  These  dire  facts  have  stimulated  efforts  to  develop  an 
international,  coordinated  strategy  for  malaria  research  and  control  [2].  Development  of  new 
drugs  and  vaccines  against  malaria  will  undoubtedly  be  an  important  factor  in  control  of  the 
disease.  However,  despite  recent  progress,  drug  and  vaccine  development  has  been  a  slow  and 
difficult  process,  hampered  by  the  complex  life  cycle  of  the  parasite,  a  limited  number  of  drug 
and  vaccine  targets,  and  our  incomplete  understanding  of  parasite  biology  and  host-parasite 
interactions. 

The  advent  of  microbial  genomics,  i.e.  the  ability  to  sequence  and  study  the  entire 
genomes  of  microbes,  should  accelerate  the  process  of  drug  and  vaccine  development  for 
microbial  pathogens.  As  pointed  out  by  Bloom,  the  complete  genome  sequence  provides  the 
“sequence  of  every  virulence  determinant,  every  protein  antigen,  and  every  drug  target”  in  an 
organism  [3],  and  establishes  an  excellent  starting  point  for  this  process.  In  1995,  an  international 
consortium  including  the  National  Institutes  of  Health,  the  Wellcome  Trust,  the  Burroughs 
Wellcome  Fund,  and  the  US  Department  of  Defense  was  formed  (Malaria  Genome  Sequencing 
Project)  to  finance  and  coordinate  genome  sequencing  of  the  human  malaria  parasite 
Plasmodium  falciparum,  and  later,  a  second,  yet  to  be  determined,  species  of  Plasmodium. 
Another  major  goal  of  the  consortium  was  to  foster  close  collaboration  between  members  of  the 
consortium  and  other  agencies  such  as  the  World  Health  Organization,  so  that  the  knowledge 
generated  by  the  Project  could  be  rapidly  applied  to  basic  research  and  antimalarial  drug  and 
vaccine  development  programs  worldwide. 


Body 

This  report  describes  progress  in  the  Malaria  Genome  Sequencing  Project  achieved  by 
The  Institute  for  Genomic  Research  and  the  Malaria  Program,  Naval  Medical  Research  Center, 
under  Cooperative  Research  Agreement  DAMD 17-98-2-8005,  over  the  12  month  period  from 
Dec.  ’98  to  Dec  ’99.  The  specific  aims  of  the  work  covered  under  this  cooperative  agreement 
were  to: 

1.  Determine  the  sequence  of  3.5  megabases  of  the  P.  falciparum  genome  (clone 

3D7): 


a)  Construct  small-insert  shotgun  libraries  (1-2  kb  inserts)  of  chromosomal  DNA  isolated 
from  preparative  pulsed-field  gels. 

b)  Sequence  a  sufficiently  large  number  of  randomly  selected  clones  from  a  shotgun 
library  to  provide  10-fold  coverage  of  the  selected  chromosome. 
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c)  Construct  PI  artificial  chromosome  (PAC)  libraries  (inserts  up  to  20  kb)  of 
chromosomal  DNA  isolated  from  preparative  pulsed-field  gels. 

d)  If  necessary,  generate  additional  STS  markers  for  the  chromosome  by  i)  mapping 
unique-sequence  contigs  derived  from  assembly  of  the  random  sequences  to  chromosome,  ii) 
mapping  end-sequences  from  chromosome-specific  PAC  clones  to  YACs. 

e)  Use  TIGR  Assembler  to  assemble  random  sequence  fragments,  and  order  contigs  by 
comparison  to  the  STS  markers  on  each  chromosome. 

f)  Close  any  remaining  gaps  in  the  chromosome  sequence  by  PCR  and  primer-walking 
using  P.  falciparum  genomic  DNA  or  the  YAC,  BAC,  or  PAC  clones  from  each  chromosome  as 
templates. 

2.  Analyze  and  annotate  the  genome  sequence: 

a)  employ  a  variety  of  computer  techniques  to  predict  gene  structures  and  relate  them  to 
known  proteins  by  similarity  searches  against  databases;  identify  untranslated  features  such  as 
tRNA  genes,  rRNA  genes,  insertion  sequences  and  repetitive  elements;  determine  potential 
regulatory  sequences  and  ribosome  binding  sites;  use  these  data  to  identify  metabolic  pathways 
in  P.  falciparum. 

3.  Establish  a  publicly-accessible  P.  falciparum  genome  database  and  submit 
sequences  to  GenBank. 


We  are  pleased  to  report  that  excellent  progress  has  been  made  towards  achievement  of 
these  goals.  In  last  year’s  annual  report  we  announced  the  the  publication  in  Science  of  the  first 
complete  sequence  of  a  malarial  chromosome  (chromosome  2)  [4].  In  addition,  we  reported  on 
work  done  by  the  TIGR/NMRC  team  and  collaborators  to  provide  new  tools  and  resources  for 
the  Malaria  Genome  Project,  including  development  of  a  Plasmodium  gene  finding  program, 
GlimmerM  [5],  and  introduction  of  optical  restriction  mapping  technology  for  rapid  mapping  of 
whole  Plasmodium  chromosomes  [6],  We  also  reported  that  sequencing  of  3  additional  P. 
falciparum  chromosomes  was  underway,  and  that  we  were  investigating  the  use  of  microarray 
technology  to  examine  the  expression  of  all  genes  from  chromosomes  2  and  3  of  Plasmodium. 

To  facilitate  community  access  to  the  sequence  data,  a  P.  falciparum  genome  web  site  was  also 
established  at  TIGR  which  contains  all  of  the  chromosome  2  sequence  data  and  annotation,  as 
well  as  preliminary  data  for  other  chromosomes  currently  being  sequenced 
(http://www.tigr.org/tdb/mdb/pfdb/pfdb.html). 

In  the  past  year  we  have  completed  the  high-throughput  sequencing  phase  of 
chromosomes  10,  11,  and  14,  which  together  account  for  30  %  of  the  genome.  These 
chromosomes  are  now  in  the  gap  closure  phase,  and  chromosome  14  is  expected  to  be  completed 
this  year,  and  chromosomes  10  and  1 1  will  be  completed  shortly  after.  We  also  collaborated  with 
David  Schwartz’s  laboratory  in  construction  of  a  two-enzyme  optical  restriction  map  of  the 
entire  P.  falciparum  genome;  this  was  published  recently  in  Nature  Genetics  [7].  As  indicated  in 


6 


I 


last  year’s  report  we  also  initiated  a  functional  genomics  program  in  collaboration  with  the 
Malaria  Program,  NMRC.  Glass  slide  microarrays  containing  PCR  fragments  from  almost  all 
genes  from  chromosomes  2  and  3  have  been  prepared,  and  experiments  to  profile  the  expression 
of  these  genes  through  the  erythrocytic  stage  of  the  life  cycle  are  underway.  We  have  also 
assisted  NMRC  in  their  pilot  project  to  apply  the  techniques  of  proteomics  towards  the 
identification  of  novel  antigens  in  parasite  (sporozoite)  extracts.  Finally,  we  are  currently 
reviewing  with  NMRC  further  steps  that  can  be  taken  to  more  rapidly  apply  Plasmodium 
genomics,  functional  genomics,  and  proteomics  to  problems  of  vaccine  development  for  malaria. 


Sequencing  of  P.  falciparum  chromosome  14  (Specific  Aim  1) 

Sequencing  of  chromosome  14  (3.4  Mb)  is  being  funded  primarily  by  a  grant  from  the 
Burroughs  Wellcome  Fund;  funds  from  this  collaborative  agreement  are  being  used  to  accelerate 
the  sequencing,  assist  in  closure  and  annotation,  develop  microarrays  for  chromosome  14,  and 
facilitate  rapid  utilization  of  the  sequence  data  by  the  DoD  vaccine  and  drug  development 
groups.  In  last  year’s  report  we  described  the  isolation  of  chromosome  14  DNA  on  pulsed  field 
gels,  preparation  of  shotgun  libraries  in  pUC18,  and  high  throughput  sequencing  of  these 
libraries.  The  high-throughput  sequencing  phase  of  the  project  was  completed  in  December 
1998.  74,292  sequences  with  an  average  read  length  of  530  nt  were  produced.  All  of  these 
sequences  were  performed  with  FS+  dye  terminator  chemistry  which  we  had  previously  found  to 
be  superior  to  dye  primer  chemistry  for  the  sequencing  of  AT-rich  P.  falciparum  DNA.  This  is 
equivalent  to  9X  coverage  assuming  that  due  to  co-migration  of  sheared  nuclear  DNA  with  the 
chromosome  14  DNA  on  pulsed  field  gels,  20%  of  sequences  in  the  shotgun  library  were  derived 
from  other  chromosomes.  The  sequences  were  assembled  in  a  2  step  procedure  with  TIGR 
Assembler  [8].  The  first  assembly  was  performed  at  99.5%  stringency  to  produce  a  robust  set  of 
conti gs;  these  contigs  and  the  remaining  unassembled  sequences  were  then  used  to  start  a  second 
assembly  at  97.5%  stringency.  1,750  contigs  were  obtained  and  the  largest  contig  was  99  kb.  In 
comparison,  the  largest  contig  obtained  after  the  first  assembly  of  the  chromosome  2  data  was 
about  20  kb,  indicating  that  exclusive  use  of  the  dye  terminator  chemistry  for  chromosome  14 
resulted  in  the  production  of  high  quality  sequence  data. 

The  gap  closure  process  began  in  December  1998.  The  procedures  being  used  to  close 
gaps  are  basically  the  same  as  those  used  previously  on  the  chromosome  2  project[4],  namely  1) 
use  of  GROUPER  software  to  identify  groups  (contigs  linked  by  shotgun  clones),  physical  gaps 
and  sequence  gaps;  2)  editing  of  contigs  ends  to  remove  untrimmed  vector  sequence,  low  quality 
sequence  data,  and  chimeric  clones  that  prevent  merging  of  contigs;  3)  resequencing  of  missing 
mates  and  short  sequences  at  contig  ends;  4)  sequencing  of  shotgun  clones  spanning  sequence 
gaps  using  primers  at  the  ends  of  the  gaps;  5)  PCR  with  genomic  DNA  to  span  physical  gaps; 
and  6)  use  of  the  transposon  insertion  method  to  close  very  AT-rich  gaps.  In  practice, 

GROUPER  is  run  on  the  set  of  contigs  produced  by  an  assembly  and  some  or  all  of  steps  1-6  are 
performed  until  no  further  progress  is  possible.  Another  assembly  is  then  performed  with  the 
edited  contigs,  new  sequences  ( e.g .  primers  walks  and  missing  mates),  and  unassembled 
sequences  left  over  from  the  previous  assembly.  The  new  assembly  will  incorporate  new 
sequences  such  as  primer  walks  produced  during  closure,  sequences  edited  during  closure,  and 
other  sequences  that  did  not  get  merged  into  the  previous  assembly,  thereby  providing  new 
starting  points  for  additional  work.  This  process  is  repeated  until  the  sequence  is  closed. 
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As  noted  above,  due  to  cross-contamination  of  the  chromosome  14  DNA  with  sheared 
nuclear  DNA,  up  to  20%  of  the  sequence  data  is  derived  from  chromosomes  other  than 
chromosome  14.  In  order  to  focus  the  closure  efforts  on  chromosome  14  contigs,  chromosome  14 
markers  are  used  to  identify  which  contigs  and  groups  of  contigs  are  from  chromosome  14.  With 
chromosome  2  about  30  markers  were  available  (1  marker  per  30  kb).  In  contrast,  for 
chromosome  14  there  are  98  STS  markers  derived  from  YACs  (provided  by  Alister  Craig)  plus 
an  additional  101  SSLP  markers  [9],  providing  a  marker  about  every  17-20  kb.  The  higher 
density  of  markers  will  allow  identification  of  more  chromosome  14  contigs  and  should  simplify 
the  gap  closure  process.  In  addition,  with  funding  provided  by  the  BWF,  David  Schwartz’s  group 
has  completed  a  2-enzyme  optical  restriction  map  of  the  P.  falciparum  genome  [7].  We  will  use 
the  optical  map  and  the  chromosome  14  markers  to  determine  the  order  of  contig  groups  on  the 
chromosome;  this  should  permit  us  to  reduce  the  number  of  PCR  reactions  required  for  closure 
of  the  physical  gaps. 

To  date  we  have  performed  2  cycles  of  the  closure  procedure  on  the  chromosome  14 
contigs  and  have  nearly  completed  the  third  cycle  (Table  1).  In  the  first  cycle  between  12/98  and 
2/99,  most  of  our  efforts  were  focused  on  editing  of  the  contig  ends  and  on  performing  the 
sequencing  reactions  for  the  missing  mates  and  short  sequences  identified  at  physical  gaps.  The 
most  time  consuming  and  labor-intensive  part  of  the  process  is  editing.  Three  individuals  spent  6 
weeks  editing  the  ends  of  contigs  from  the  initial  assembly  in  order  to  remove  untrimmed  vector 
and  low-quality  sequences  that  prevented  the  merging  of  overlapping  contigs.  In  subsequent 
rounds  of  closure  we  have  re-sequenced  an  additional  1412  missing  mates  and  short  sequences 
from  sequence  gaps  and  have  performed  755  primer  walks.  Between  12/98  and  7/99  we  closed 
47%  of  the  physical  gaps  and  65%  of  the  sequence  gaps,  and  one-fourth  of  the  chromosome  is 
now  covered  by  contigs  larger  than  100  kb.  About  one-third  of  the  primer  walks  have  yet  to  be 
completed  and  additional  editing  is  underway.  Once  these  steps  are  completed  another  assembly 
will  be  performed.  We  expect  that  >  80%  of  sequence  gaps  will  have  been  closed  at  this  point. 
The  remaining  gaps  are  likely  to  be  composed  of  very  AT-rich  sequence;  closure  of  these  AT- 
rich  gaps  will  require  use  of  the  transposon  insertion  technique  that  was  used  for  closure  of  AT- 
rich  gaps  in  chromosome  2. 

As  shown  in  Table  1,  closure  of  physical  gaps  has  lagged  behind  closure  of  the  sequence 
gaps.  This  is  primarily  due  to  the  fact  that  most  of  our  work,  apart  from  the  use  of  database 
queries  to  identify  missing  mates  and  short  sequences  at  physical  gaps,  has  focused  on  closure  of 
the  sequence  gaps.  This  was  done  in  order  to  obtain  larger  contigs  that  could  be  placed  more 
accurately  on  the  YAC,  SSLP,  and  optical  restriction  maps  of  the  chromosome.  By  locating 
groups  of  contigs  on  the  chromosome  map  PCR  reactions  using  primers  from  adjacent  groups 
can  be  used  to  close  physical  gaps.  About  one-third  of  the  physical  gaps  on  chromosome  2  were 
closed  in  this  way.  Once  these  gaps  are  closed  the  remaining  gaps  can  be  closed  by  performing  a 
series  of  combinatorial  PCRs  using  one  primer  from  a  mapped  group  and  another  primer  from  an 
unmapped  group. 
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Table  1.  Progress  in  gap  closure  of  P.  falciparum  chromosome  14. 


12/98 

2/99 

3/99 

7/99 

Sequences 

74,292 

74,994 

75,92 

9 

76,406 

Contigs 

1,750 

1,555 

1,466 

1,418 

Largest  contig  (kb) 

99 

124 

124 

164 

Total  groups 

458 

293 

291 

NDa 

Chr  14  groups 

63 

37 

34 

ND 

Cum.  Length  (Mb) 

2.99 

3.49 

3.45 

ND 

Physical  gaps 

62 

36 

33 

ND 

Sequence  gaps 

184 

180 

112 

~  64 

“ND,  not  determined. 


Recently,  however,  we  prepared  primers  for  all  of  the  physical  gaps  and  have  performed 
PCR  reactions  with  the  primers  and  genomic  DNA  in  order  to  span  these  gaps.  So  far,  by 
performing  PCR  reactions  with  primers  from  the  ends  of  adjacent  groups  on  the  chromosome, 
we  have  obtained  products  spanning  about  75%  of  the  physical  gaps  and  are  in  the  process  of 
sequencing  these  products.  Many  of  these  PCR  products  are  very  AT-rich  and  have  been  difficult 
to  sequence.  As  was  done  with  chromosome  2,  many  of  these  PCR  products  may  need  to  be 
cloned  and  subjected  to  the  transposon  insertion  protocol  in  order  to  obtain  good  sequence  data 
in  the  AT-rich  areas.  To  obtain  PCR  products  from  the  remaining  physical  gaps  we  have  begun  a 
combinatorial  PCR  procedure  in  which  a  primer  from  one  end  of  a  mapped  group  is  tested  in 
series  of  PCR  reactions  with  primers  from  the  ends  of  unmapped  groups.  This  process  has 
already  generated  several  new  PCR  products  that  are  currently  being  sequenced.  We  are  also 
investigating  use  of  a  multiplex  PCR  strategy  in  which  pools  of  four  or  more  primers  are  used  in 
PCR  reactions  [10].  This  reduces  the  number  of  PCR  reactions  that  must  be  performed  during 
closure  and  has  been  very  successful  in  accelerating  closure  of  microbial  genomes.  The  AT- 
richness  of  Plasmodium  DNA  makes  multiplex  PCR  more  difficult  than  with  other  genomes,  but 
we  recently  obtained  PCR  products  for  several  physical  gaps  via  multiplexing  that  are  being 
sequenced. 

Perhaps  the  biggest  obstacle  faced  during  the  closure  process  is  sequencing  through  long 
stretches  (  up  to  50  bp)  of  As  or  Ts.  We  and  others  have  found  that  the  sequence  quality 
deteriorates  rapidly  as  the  Taq  polymerase  passes  through  these  homopolymer  stretches,  such 
that  accurate  sequence  data  is  very  difficult  to  attain  in  these  regions.  These  regions  of  lower 
than  average  sequence  quality  have  the  effect  of  introducing  sequence  gaps,  which  in  this  case 
are  regions  of  DNA  for  which  good  sequence  data  cannot  be  attained.  The  solution  we  devised  in 
the  chromosome  2  project  was  to  use  the  transposon  insertion  method  to  insert  primer  binding 
sites  into  the  AT-rich  areas.  Frequently,  by  priming  the  sequencing  reaction  within  or  very  close 
to  the  homopolymer  regions,  adequate  sequence  data  could  be  obtained.  However,  this  is  a  very 
labor  intensive  process  and  entails  performing  50-100  sequence  reactions  for  every  gap  caused 
by  a  homopolymer  stretch.  To  try  to  improve  sequencing  of  these  regions,  we  are  currently 
testing  modifications  to  our  standard  sequencing  reactions,  including  changes  in  extension 
temperatures,  nucleotides  mixes,  salt  concentrations,  etc.  If  these  simple  modifications  improve 
sequence  quality  in  the  AT-rich  regions,  the  gap  closure  process  could  be  accelerated. 
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Once  all  gaps  have  been  closed,  the  sequence  will  be  evaluated  with  the  program 
check_coverage  to  ensure  that  a)  all  regions  of  the  assembly  are  covered  by  at  least  two  shotgun 
clones,  and  b)  that  every  base  pair  in  the  sequence  has  been  sequenced  in  both  directions  with 
one  chemistry,  or  in  one  direction  with  two  chemistries.  These  criteria  ensure  that  the  sequence 
has  been  assembled  correctly  and  validate  individual  base  calls.  The  latter  criterion  is  often 
satisfied  by  performing  10%  of  the  sequence  reactions  with  dye-primer  chemistry.  However, 
given  the  frequency  of  sequence  artifacts  in  AT-rich  regions  observed  with  the  dye-primer 
chemistry,  this  may  not  be  appropriate  for  P.  falciparum.  As  we  discovered  with  chromosome  2, 
inclusion  of  sequences  containing  artifacts  in  an  assembly  inhibits  contig  formation  and  increases 
the  number  of  sequence  gaps  in  the  assembly  and  the  effort  required  to  close  them. 

Consequently,  all  chromosome  14  sequencing  were  done  with  dye-terminator  chemistry,  and  late 
in  the  closure  phase  the  coverage  status  of  the  assembly  will  be  assessed.  Regions  with  one- 
direction  coverage  will  be  identified,  and  additional  dye-terminator  reactions  selected  from  the 
database  will  be  performed  to  convert  as  many  as  possible  to  two-direction  coverage.  Regions 
with  one-direction  coverage  that  remain  and  which  have  unresolved  sequence  ambiguities  will 
then  be  re-sequenced  with  dye-primer  chemistry.  This  process  will  ensure  that  the  coverage 
criteria  are  satisfied  and  minimize  potential  assembly  problems  arising  from  use  of  dye-primer 
chemistry.  Finally,  the  sequence  will  be  edited  using  the  program  TIGR_Editor,  which  displays 
all  gel  reads  and  electropherograms  for  each  base  in  the  sequence.  Discrepancies  will  be  noted 
and  additional  sequencing  reactions  will  be  performed  to  resolve  ambiguities.  As  a  last  step  to 
confirm  colinearity  of  the  assembled  sequence  and  genomic  DNA,  restriction  maps  predicted 
from  the  sequence  will  be  compared  with  the  chromosome  14  optical  restriction  maps  described 
above.. 

Elucidation  of  gene  structure  will  be  performed  with  the  program  GlimmerM,  a 
eukaryotic  gene-finding  developed  at  TIGR  specifically  for  the  malaria  genome  project  (see 
section  below).  Before  the  annotation  of  chromosome  14  begins,  GlimmerM  will  be  refined  to 
improve  accuracy  and  the  training  set  will  be  updated  with  newly-published  sequences,  so  that  a 
more  robust  gene-finding  tool  will  be  available  once  the  sequence  is  completed.  Predicted  coding 
regions  will  be  searched  against  the  sequence  and  protein  databases  using  our  standard  methods. 
Repetitive  elements  and  other  features  will  also  be  identified  and  annotated.  Since  many  genes 
will  have  no  database  matches,  defining  the  boundaries  of  genes  will  be  challenging.  Most  of  the 
software  necessary  for  annotation  was  tested  during  the  chromosome  2  project,  and  will  require 
only  a  few  minor  modifications  for  use  on  chromosome  14  The  annotation  performed  under  this 
grant  will  by  necessity  be  preliminary.  Our  goal  is  to  provide  a  starting  point  for  further 
biological  characterization.  We  will  facilitate  public  access  to  the  sequence  by  release  of 
preliminary  and  finished  sequence  on  the  TIGR  web  site 

(http://www.tigr.org/tdb/mdb/pfdb/pfdb.html).  This  will  include  full  text-  and  sequence-based 
searching  of  chromosomes  2  and  14,  as  well  as  links  to  other  sources  of  P.  falciparum  sequence 
data  such  as  the  Sanger  Center  and  Stanford  University.  Since  the  start  of  the  random  sequencing 
phase  raw  shotgun  sequences  and  contigs  from  test  assemblies  have  been  released  on  the  TIGR 
web  site.  Upon  completion  of  the  random  phase  of  the  project  the  complete  set  of  >  74,000 
shotgun  sequences  and  the  contigs  from  the  first  full  assembly  were  placed  on  the  web  site. 

These  contigs  have  been  updated  approximately  every  6-8  weeks  as  gap  closure  has  progressed. 
In  addition,  early  this  year  we  installed  a  new  BLAST  server  that  returns  the  BLAST  output  as  ' 
well  as  the  FASTA-formatted  sequence  of  the  best  hit  plus  1  kb  on  either  side.  This  enables 
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those  who  are  unable  to  process  the  very  large  assembly  files  to  retrieve  the  sequence  of  interest 
without  help  from  the  sequencing  center. 


Sequencing  of  chromosomes  10  and  11  (Specific  Aim  1) 

Chromosomes  10  and  11,  which  together  constitute  16%  of  the  genome  (1.7  and  2.0  Mb, 
respectively),  are  being  sequenced  primarily  with  funding  provided  by  the  National  Institute  for 
Allergy  and  Infectious  Diseases  (L.M.  Cummings  is  the  Principal  Investigator).  Funds  from  this 
collaborative  agreement  are  being  used  to  accelerate  the  sequencing,  assist  in  closure  and 
annotation,  develop  microarrays  for  these  chromosomes,  and  facilitate  rapid  utilization  of  the 
sequence  data  by  the  DoD  vaccine  and  drug  development  groups.  The  random  phase  for 
chromosomes  1 1  and  10  was  completed  in  mid-  and  late-  1999,  respectively,  and  these 
chromosomes  are  now  in  closure.  The  closure  procedure  for  these  chromosomes  will  be  very 
similar  to  that  used  for  chromosomes  2  and  14  (see  chromosome  14  section  above),  and  will  take 
advantage  of  any  technical  improvements  that  are  produced.  One  major  difference  in  the  closure 
process,  however,  is  that  many  fewer  microsatellite  markers  are  available  for  these 
chromosomes[9,  11],  making  physical  gap  closure  more  difficult  by  reducing  the  number  of 
contigs  that  can  be  accurately  ordered  on  the  chromosome.  Consequently,  Dr.  Cummings  is 
collaborating  with  Dr.  X  Su  of  the  Laboratory  of  Parasitic  Diseases,  NIAID,  in  production  of 
additional  microsatellite  markers  for  these  chromosomes.  Raw  sequence  reads  and  preliminary 
contigs  of  chromosomes  10  and  1 1  have  been  released  on  the  TIGR  web  and  will  be  updated 
periodically  as  closure  proceeds  (http://www.tigr.org/tdb/mdb/pfdb/pfdb.html). 


Optical  mapping  of  P.  falciparum  chromosomes  (added  to  Specific  Aim  1) 

Last  year  we  reported  that  we  had  collaborated  with  Dr.  David  Schwartz  in  production  of 
optical  restriction  maps  for  chromosome  2.  We  demonstrated  that  the  optical  restriction  maps 
were  very  useful  for  independent  verification  of  the  final  chromosome  2  sequence.  The 
successful  application  of  the  optical  mapping  approach  to  sequence  validation  of  chromosome  2 
led  the  Malaria  Genome  Sequencing  Consortium  to  recommend  that  Dr.  Schwartz’s  laboratory 
be  funded  by  the  Burroughs  Wellcome  Fund  to  generate  two-enzyme  restriction  maps  of  the 
entire  P.  falciparum  genome,  using  high-molecular  weight  genomic  DNA  provided  by  Dr.  Dan 
Carucci,  NMRC.  A  map  of  the  complete  genome  was  recently  determined  and  has  been 
published  in  Nature  Genetics[7].  The  whole  genome  optical  map  has  proven  to  be  very  useful  in 
the  gap  closure  process,  by  assisting  in  ordering  of  contigs  on  the  chromosome.  In  cases  where 
contigs  have  matches  to  sequence  markers,  the  optical  map  provides  confirmation  of  the  map 
position.  In  cases  where  contigs  do  not  match  any  markers,  the  maps  often  provide  a  preliminary 
localization,  which  can  be  confirmed  later  during  gap  closure.  These  optical  maps  are  being  used 
by  all  of  the  sequencing  centers  working  on  Plasmodium,  and  other  investigators  working  on 
other  parasites  are  also  beginning  to  use  these  maps  in  their  genome  sequencing  projects  (e.g. 
Trypanosoma  cruzi,  N.  El  Sayed,  TIGR,  personal  communication).  Optical  maps  may  prove  very 
useful  in  the  sequencing  of  other  malaria  parasites  such  as  P.  yoelii  and  P.  vivax,  for  which  we 
do  not  have  many  sequence  tagged  sites  or  microsatellite  markers  to  assist  in  gap  closure. 

Indeed,  preliminary  optical  map  data  obtained  by  Dr.  Carucci  at  NMRC  suggests  that  P.  yoelii, 
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contrary  to  expectations,  may  have  only  13  chromosomes.  We  are  currently  working  on  ways  to 
more  efficiently  use  optical  map  data  in  the  closure  process  by  writing  software  that  directly 
maps  contigs  onto  the  optical  map.  Currently  this  process  is  done  manually. 


Development  and  utilization  of  a  P.  falciparum  gene  finding  program  (added  to 
Specific  Aim  2) 

Last  year  we  reported  the  development  of  new  gene  finding  software,  GlimmerM,  that 
was  written  specifically  for  annotation  of  P.  falciparum  chromosome  2.  The  GlimmerM  program 
[5],  when  provided  with  a  training  set  of  well-characterized  P.  falciparum  genes,  constructs  a 
statistical  model  of  coding  sequences  and  donor  and  acceptor  splice  sites.  New,  uncharacterized 
P.  falciparum  sequence  is  then  analyzed  by  GlimmerM,  and  a  set  of  putative  gene  models  is 
produced.  These  models  are  then  evaluated  by  expert  annotators  in  conjunction  with  other 
evidence  such  as  database  matches,  the  presence  of  signal  peptides  and  transmembrane  domains, 
etc.,  to  produce  the  final  gene  models  reported  in  the  annotation.  This  year  the  GlimmerM 
software  has  been  improved  by  enlargement  of  the  training  set.  The  size  of  the  training  set  has 
been  increased  by  adding  1),  gene  models  from  the  recently  published  chromosome  3  sequence 
[12],  2),  other  newly  published  P.  falciparum  sequences  from  GenBank,  and  3),  long  open 
reading  frames  from  the  chromosome  14  preliminary  data.  Since  GlimmerM  works  by 
constructing  as  statistical  model  of  P.  falciparum  genes,  enlargement  of  the  training  set  should 
improve  the  accuracy  of  the  gene  predictions.  Once  chromosomes  10,  11,  and  14  have  been 
completed,  the  first  step  in  the  annotation  process  will  be  to  use  the  improved  GlimmerM 
software  to  predict  gene  models  in  the  chromosome  sequence.  These  gene  models  will  then  be 
searched  against  the  protein  and  nucleotide  sequence  databases  to  identify  the  genes.  GlimmerM 
has  been  provided  to  the  other  members  of  the  sequencing  consortium  and  is  also  available  from 
the  TIGR  web  site. 


Microarray  studies  (added  to  Specific  Aim  1) 

In  last  year’s  report  we  described  our  first  efforts  to  add  functional  genomics  studies  to 
this  P.  falciparum  genome  sequencing  project.  The  aim  of  these  functional  genomic  studies  is  to 
provide  a  more  complete  view  of  Plasmodium  biology  by  determining  gene  expression 
information  for  all  Plasmodium  genes  that  are  identifed  through  the  genome  sequencing  effort. 
We  chose  to  use  glass  slide  microarrays  for  this  work  [13].  Microarrays  can  be  used  to  examine 
the  expression  patterns  of  thousands  of  genes  simultaneously  from  two  or  more  RNA  samples. 
These  RNA  samples  may  be  derived  from  parasites  grown  under  different  growth  conditions,  or 
from  different  life  cycle  stages,  in  order  to  determine  the  complement  of  genes  that  may  be 
differentially  expressed  under  varying  conditions.  In  pilot  studies  conducted  at  NMRC  to 
evaulate  this  technology,  PCR  products  representing  virtually  all  genes  from  chromosome  2  were 
prepared  and  arrayed  on  glass  slides  using  TIGR’s  Molecular  Dynamics  Arrayer  robot.  Total 
RNA  was  prepared  from  cultured  P.  falciparum  (clone  3D7)  taken  at  several  time  points.  The 
cDNA  from  each  RNA  species  was  differentially  labelled  with  either  dUTP-Cy3  or  dUTP-Cy5 
and  hybridized  to  a  DNA  microarray  at  65°  C  overnight.  After  several  washes  the  DNA 
microarrays  were  scanned  using  a  ScanArray®  3000  dual  color  confocal  laser  system. 
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Fluorescence  intensity  measurements  of  each  spot  were  made  using  the  computer  program 
ImaGene™.  Analysis  of  the  data  revealed  clear  examples  of  differential  gene  expression  during 
the  erythrocytic  cycle,  which  encouraged  us  to  proceed  with  construction  of  new  microarrays 
including  all  genes  identified  in  completed  chromosome  sequences. 

In  the  past  year  the  microarrays  have  been  expanded  to  include  all  genes  from 
chromosome  3  [12],  and  RNA  samples  from  6  hourly  time  points  taken  throughout  the  48  hr 
erythocytic  cycle  have  been  prepared.  A  series  of  hybridization  experiments  have  been 
performed  with  the  chromosome  2  and  3  arrays  and  the  series  of  erythrocytic  stage  RNA  probes. 
Analysis  of  the  results  is  underway  and  will  provide  a  profile  of  gene  expression  throughout  the 
erythrocytic  cycle.  This  information  may  shed  light  on  the  function  of  the  novel  genes  identified 
on  these  chromosomes  and  and  identify  potential  blood  stage  antigens  for  vaccine  development. 
With  chromosome  14  now  in  the  late  closure  phase,  we  have  also  begun  to  prepare  primers  for 
amplication  of  all  genes  on  chromosome  14.  When  the  chromosome  14  PCR  products  are  added 
to  the  chromosome  2  and  3  arrays  later  this  year,  about  20%  of  Plasmodium  genes  will  be 
represented  on  the  arrays.  As  other  chromosomes  are  completed  by  TIGR/NMRC  and  other 
members  of  the  consortium,  we  will  add  these  genes  to  the  arrays  as  well. 

These  experiments  produce  huge  amounts  of  data.  In  order  to  efficiently  store  and 
analyze  this  information,  a  SyBase  relational  database  was  developed  at  NMRC.  SyBase  is  used 
at  TIGR  for  storage  of  all  genome  sequence  and  expression  data,  and  so  was  chosed  by  NMRC 
to  provide  a  seamless  integration  of  data  generated  from  the  chromosome  2,  10,  1 1  and  14 
projects  at  TIGR.  A  Web  interface  has  also  been  developed  to  facilitate  data  entry,  tracking, 
analysis  and  presentation. 


Sequencing  of  other  Plasmodium  species  (Specific  Aims  1,2,3) 

The  primary  goal  of  the  Malaria  Genome  Sequencing  Project  was  to  sequence  the 
genome  of  P.  falciparum.  It  appears  that  within  18  months  the  random  sequencing  phase  of  this 
project  will  have  been  completed,  so  that  virtually  all  P.  falciparum  genes  will  have  at  least 
partial  sequences  in  the  databases.  Complete  closure  of  the  chromosomes  will  undoubtably  take 
longer,  but  malariaologists  will  nevertheless  have  access  to  most  P.  falciparum  genes.  Even 
today,  with  only  2  chromosomes  completed,  the  genome  project  has  had  a  major  effect  on 
malaria  research  [14-18]. 

A  secondary  goal  of  the  project  was  to  sequence  the  genome  of  a  second  species  of 
malaria,  and  discussions  as  to  which  parasite  should  be  chosen  had  generated  lively  discussions 
amongst  the  malaria  community,  with  some  groups  favoring  seqencing  of  the  human  malaria  P. 
vivax,  and  others  advocating  sequencing  one  of  the  rodent  malaria  parasites  that  are  used  as 
model  systems.  The  sequence  of  one  or  more  species  would  be  very  useful  for  comparison  to  P. 
falciparum,  perhaps  enabling  the  identification  of  genes  that  may  be  involved  in  differences  in 
life  cycles  and  pathogencity,  for  example.  Recent  discussions  at  the  semi-annual  meeting  of  the 
malaria  genome  consortium  may  lead  to  efforts  funded  by  the  NIAID,  the  Burroughs  Wellcome 
Fund,  or  the  Wellcome  Trust,  to  do  partial  sequencing  of  several  rodent  malaria  genomes,  which 
will  provide  useful  sequence  data  to  groups  working  on  these  different  parasites  at  a  reasonable 
cost. 
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The  TIGR/NMRC  team  is  currently  discussing  expansion  of  our  sequencing  efforts  to 
include  P.  vivax,  which  is  of  major  concern  as  a  major  human  pathogen,  and  the  rodent  malaria 
parasite  P.  yoelii,  which  is  used  as  a  model  system  for  vaccine  development  by  NMRC.  To  date, 
we  have  performed  limited  sequencing  of  genomic  libraries  from  P.  yoelii',  this  data  will  be 
compared  to  the  P.  falciparum  genome  in  order  to  ascertain  how  useful  the  P.  yoelii  data  will  be 
for  identification  of  homologs  in  P.  falciparum.  If  successful,  it  may  be  possible  to  rapidly 
identify  P.  yoelii  homologs  of  P.  falciparum  vaccine  candidates,  and  facilitate  modeling  of 
vaccines  in  the  rodent  model.  Our  primary  interest,  however,  is  in  sequencing  of  P.  vivax,  using 
either  a  whole  genome  shotgun  approach,  or  a  B AC-based  sequence  tag  connector  strategy  [19]. 
Unfortunately,  one  would  strongly  prefer  to  sequence  a  cloned  P.  vivax  parasite  and  no  such 
clones  are  available.  We  are  currently  trying  to  determine  thefeasibility  of  cloning  this  parasite 
using  in  vitro  culture  methods  or  primate  models.  In  the  meantime,  we  have  initiated  a 
collaboration  with  Dr.  Thomas  Wellems  lab  at  the  Laboratory  of  Parasitic  Diseases,  NIAID,  to 
sequence  a  P.  vivax  YAC  clone  corresponding  to  the  chloroquine  resistance  locus  of  P. 
falciparum.  Sequencing  of  this  YAC  will  provide  the  Wellems  lab  with  important  data  regarding 
dchloroquine  resistance  in  P.  vivax,  and  also  enable  us  to  evaluate  the  sequencing  methods  to  be 
used  for  this  genome.  We  have  also  arranged  a  pilot  project  for  production  of  a  P.  vivax  BAC 
library  that  will  be  required  for  sequencing  of  the  genome. 


Modifications  to  the  Specific  Aims 

Since  the  random  sequencing  phase  of  chromosomes  10,  11,  and  14  has  been  finished  and 
completion  of  the  all  3  chromosomes  appears  likely  by  the  end  of  2001,  the  TIGR/NMRC  team 
has  initiated  discussions  regarding  the  best  ways  to  more  efficiently  use  the  sequence 
information  to  accelerate  vaccine  development.  These  discussions  are  ongoing,  but  at  this  point  it 
is  clear  that  information  derived  from  genomics  (sequence  and  annotation),  functional  genomics 
(gene  expression  information),  and  proteomics  (protein  expression  information)  must  be  applied 
in  a  systematic  effort  to  identify  candidate  vaccine  antigens  and  to  evaluate  their  immunogenicity 
and  protective  efficacy.  In  the  next  60  days  we  expect  to  have  completed  an  outline  of  such  a 
program  and  will  take  steps  to  implement  it. 


Key  Research  Accomplishments 

1)  Chromosomes  10,  11,  and  14  of  Plasmodium  falciparum  have  completed  the  random 
sequencing  phase  (over  297,000  sequence  reactions)  and  are  now  in  the  gap  closure  phase. 

2)  Preliminary  sequence  data  for  chromosomes  10,  11,  and  14  has  been  released  on  the  TIGR 
web  site  fhttp://www.tigr.org/tdb/edb/pfdb/pfdb.htmll. 

3)  In  collaboration  with  Dr.  David  Schwartz,  an  optical  restriction  map  of  the  P.  falciparum 
genome  was  determined. 

4)  The  GlimmerM  gene  finding  software  has  been  improved  by  enlargement  of  the  set  of 
sequences  used  for  training  of  the  software 
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5)  Microarrays  containing  PCR  fragments  representing  virtuall  y  all  chromosome  2  and  3  gene  s 
have  been  prepared  and  pilot  studies  to  establish  labeling,  hybridization,  and  detection 
protocols  have  been  completed.  A  Sybase  relational  database  for  storage  and  analysis  of 
microarray  data  has  been  constructed.Experiments  to  characterize  gene  expression  in 
erythrocytic  parasites  are  underway. 

6)  Small-scale  projects  to  sequence  portions  of  the  P.  yoelii  and  P.  vivax  genomes  have  been 
initiated. 


Reportable  Outcomes 

1)  Salzberg,  S.  L„  Pertea,  M.,  Delcher,  A.,  Gardner,  M.  J.  &  Tettelin,  H.  Interpolated  Markov 
models  for  eukaryotic  gene  finding.  Genomics  59,  24-31  (1999). 

2)  Jing,  J.,  Aston,  C„  Zhongwu,  L.,  Carucci,  D.  J.,  Gardner,  M.  J„  Venter,  J.  C.  &  Schwartz,  D. 
C.  Optical  mapping  of  Plasmodium  falciparum  chromosome  2.  Genome  Research  9,  175-181 
(1999). 

3)  Gardner  ,  M.  J.,  Tettelin,  H.,  Carucci,  D.  J.,  Cummings,  L.  M.,  Smith,  H.  O.,  Fraser,  C.  M., 
Venter,  J.  C.  &  Hoffman,  S.  L.  The  malaria  genome  sequencing  project:  complete  sequence 
of  P.  falciparum  chromosome  2.  Parassitologia  41,  69-75  (1999). 

4)  Gardner,  M.  J.  Invited  presentation:  The  malaria  genome  project;  sequencing  of  P. 
falciparum  chromosome  2.,  Universidad  de  Puerto  Rico,  Recinto  de  Ciencias  Medicas,  San 
Juan,  Puerto  Rico,  (1999). 

5)  Gardner,  M.  J.  Invited  presentation:  Sequencing  of  microbial  genomes  and  the  implications 
for  vaccine  development,  6th  Annual  IBC  Conference  of  Vaccine  Technologies,  Arlington, 
VA,  (1999). 

6)  Gardner,  M.  J.  Invited  presentation:  Microbial  genome  sequencing  and  vaccine 
development.,  Society  for  Industrial  Microbiology,  Arlington,  VA,  (1999). 

7)  Gardner,  M.  J.  Invited  presentation:  Parasite  and  fungual  genomics  at  TIGR,  Department  of 
Chemical  Engineering,  Johns  Hopkins  University,  Baltimore,  MD,  (1999). 

8)  Gardner,  M.  J.  Invited  presentation:  Microbial  genome  sequencing  and  vaccine  development, 
Society  for  Industrial  Microbiology,  Arlington,  VA,  (1999). 

9)  Gardner,  M.  J.  Invited  presentation:  Malaria  research  after  the  genome  project,  British 
Society  of  Parasitology  Malaria  Meeting,  Imperial  College,  London,  (1999). 

10)  Patent  application.  Chromosome  2  sequence  of  the  human  malaria  parasite  Plasmodium 
falciparum  and  proteins  of  said  chromosome  useful  in  antimalaria  vaccines  and  diagnostic 
reagents.  Filed  by  NMRC. 


Conclusions 

The  objectives  of  this  5-year  Cooperative  Agreement  between  TIGR  and  the 
Malaria  Program,  NMRC,  were  to:  Specific  Aim  1,  sequence  3.5  Mb  of  P.  falciparum  genomic 
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DNA;  Specific  Aim  2,  annotate  the  sequence;  Specific  Aim  3,  release  the  information  to  the 
scientific  community.  To  date,  we  have  published  the  first  complete  sequence  of  a  malarial 
chromosome  (chromosome  2  [4]),  completed  the  random  phase  sequencing  of  3  other  large 
chromosomes  totaling  7.2  Mb,  and  have  initiated  functional  genomics  studies  using  glass  slide 
micorarrays  to  characterize  the  expression  of  chromosome  2,  3,  and  14  genes  throughout  the 
erythrocytic  cycle.  We  have  also  collaborated  in  the  construction  of  a  two-enzyme  optical 
restriction  map  of  the  entire  P.  falciparum  genome  [7],  and  are  continuing  to  further  develop  the 
GlimmerM  gene  finding  software  developed  in  year  1.  In  addition,  we  have  begun  small  scale 
sequencing  of  the  rodent  malaria  P.  yoelii  and  are  collaborating  in  the  sequencing  of  part  of  a  P. 
vivax  chromosome.  Discussions  with  the  Malaria  Program,  NMRC  aimed  at  development  of  a 
program  to  use  genomics  and  functional  genomics  to  accelerate  vaccine  research  are  in  progress. 
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Computational  gene  finding  research  has  empha¬ 
sized  the  development  of  gene  finders  for  bacterial 
and  human  DNA.  This  has  left  genome  projects  for 
some  small  eukaryotes  without  a  system  that  ad¬ 
dresses  their  needs.  This  paper  reports  on  a  new  sys¬ 
tem,  GlimmerM,  that  was  developed  to  find  genes  in  the 
malaria  parasite  Plasmodium  falciparum.  Because 
the  gene  density  in  P.  falciparum  is  relatively  high, 
the  system  design  was  based  on  a  successful  bacterial 
gene  finder,  Glimmer.  The  system  was  augmented  with 
specially  trained  modules  to  find  splice  sites  and  was 
trained  on  all  available  data  from  the  P.  falciparum 
genome.  Although  a  precise  evaluation  of  its  accuracy 
is  impossible  at  this  time,  laboratory  tests  (using  RT- 
PCR)  on  a  small  selection  of  predicted  genes  con¬ 
firmed  all  of  those  predictions.  With  the  rapid  progress 
in  sequencing  the  genome  of  P.  falciparum,  the  avail¬ 
ability  of  this  new  gene  finder  will  greatly  facilitate 
the  annotation  process.  ©  1999  Academic  Press 


1.  INTRODUCTION 

The  gene  finding  research  community  has  focused 
considerable  effort  on  human  and  bacterial  genome 
sequence  analysis.  This  is  not  surprising  given  the 
attention  paid  to  both  areas.  The  Human  Genome 
Project  has  produced  many  millions  of  nucleotides  of 
sequence,  and  the  importance  of  rapidly  identifying  the 
genes  in  this  sequence  cannot  be  overstated.  This  task 
is  made  difficult  by  the  fact  that  only  1  to  3%  of  human 
genomic  sequence  is  estimated  to  code  for  proteins.  On 
the  bacterial  side,  20  complete  bacterial  and  archaeal 
genomes  have  already  been  published,  with  dozens 
more  expected  in  the  next  2  years.  Gene  finders  for 
these  prokaryotes  have  an  advantage  in  that  approxi¬ 
mately  90%  of  the  DNA  of  these  genomes  is  coding; 
thus  the  task  reduces  in  many  cases  to  choosing  be¬ 
tween  competing  reading  frames.  On  the  other  hand, 
the  demand  for  accuracy  is  correspondingly  much 
higher  in  the  prokaryotic  world. 

1  To  whom  correspondence  should  be  addressed.  Telephone:  (301) 
315-2537.  Fax:  (301)  838-0209.  E-mail:  salzberg@tigr.org. 
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In  between  these  two  genomic  worlds  lies  a  vast 
array  of  eukaryotic  organisms  whose  genomes  range  in 
size  from  that  of  a  large  prokaryote  (on  the  order  of 
tens  of  millions  of  nucleotides)  to  those  that  are  larger 
than  human  (billions  of  nucleotides).  Their  gene  den¬ 
sity  tends  to  be  much  lower  than  that  of  bacteria,  but 
many  organisms  have  a  much  higher  gene  density  than 
humans.  For  example,  the  genome  of  the  eukaryote 
Saccharomyces  cerevisiae  has  approximately  one  gene 
every  5  kb.  This  corresponds  to  a  gene  density  of  20%. 
Recently,  chromosome  2  of  the  malaria  parasite  Plas¬ 
modium  falciparum  was  completed  (Gardner  et  al., 
1998),  and  this  organism  too  has  a  gene  density  of  20%. 
The  remaining  13  chromosomes  from  malaria  should 
be  completed  over  the  course  of  the  next  few  years.  The 
much  larger  (120  million  nucleotides)  genome  of  Ara- 
bidopsis  thaliana,  which  also  is  expected  to  have  a  gene 
density  of  approximately  20%,  should  be  completed  in 
the  same  time  frame,  and  many  projects  are  under  way 
to  sequence  other  small  eukaryotes. 

Because  of  their  relatively  high  gene  density  with 
respect  to  human  DNA,  using  a  gene  finder  developed 
for  human  sequence  (or  other  organisms  with  low  gene 
density,  including  most  vertebrates  and  larger  plant 
genomes)  may  not  be  the  optimal  approach  for  P.  fal¬ 
ciparum  and  other  small  eukaryotes.  Prokaryotic  gene 
finders  are  not  well  suited  to  this  task  because  of  their 
inability  to  handle  introns.  It  is  possible  to  retrain 
human  gene  finders  using  different  data  (for  example, 
GENSCAN  (Burge  and  Karlin,  1997)  has  been  trained 
with  Arabidopsis  data),  but  one  still  runs  the  risk  that 
because  these  systems  have  been  optimized  to  find 
genes  in  DNA  that  is  only  3%  coding,  they  may  miss 
many  genes  in  genomes  such  as  P.  falciparum. 

This  paper  describes  a  gene  finder  developed  specif¬ 
ically  for  small  eukaryotes  with  a  gene  density  of 
around  20%.  This  system,  GlimmerM,  was  built  and 
trained  using  data  from  P.  falciparum,  the  malaria 
parasite.  It  was  then  used  as  the  principal  gene  finder 
for  chromosome  2  ofP.  falciparum,  which  contains  210 
genes  (209  protein  coding  genes  plus  one  tRNA)  (Gard¬ 
ner  et  al.,  1998).  Most  of  these  genes  were  found  by 
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GlimmerM,  and  as  described  below,  some  predictions 
were  confirmed  by  additional  laboratory  experiments. 

The  basis  of  GlimmerM  is  a  dynamic  programming 
algorithm  that  considers  all  combinations  of  possible 
exons  for  inclusion  in  a  gene  model  and  chooses  the 
best  of  these  combinations.  Dynamic  programming 
(DP)  has  been  the  basis  of  many  successful  eukaryotic 
gene  finders.  Hidden  Markov  model  (HMM)  systems 
use  a  DP  algorithm  called  Viterbi  that  is  a  special  case 
of  the  algorithm  here;  these  HMM  methods  include 
VEIL  (Henderson  et  al.,  1997);  GENSCAN  (Burge  and 
Karlin,  1997),  which  uses  semi-Markov  HMMs;  and 
Genie  (Kulp  et  al.,  1996),  which  uses  generalized 
HMMs.  Very  recently,  Wirth  (1998)  described  a  gene 
finder  for  P.  falciparum  based  on  generalized  HMMs, 
but  it  is  not  yet  available  for  comparison.  The  Morgan 
system  (Salzberg  et  al.,  1996,  1998a)  uses  a  DP  algo¬ 
rithm  in  combination  with  a  decision  tree  program,  and 
GeneParser  (Snyder  and  Stormo,  1995)  uses  DP  com¬ 
bined  with  a  neural  network  program.  These  latter  two 
DP  formulations  are  most  similar  to  the  formulation 
used  for  GlimmerM. 

2.  METHODS  AND  ALGORITHMS 

The  phrase  “gene  model”  will  be  used  to  denote  a  particular  com¬ 
bination  of  exons  and  introns  that  the  system  is  considering  as  a 
possible  gene.  The  decision  about  what  gene  model  is  best  is  a 
combination  of  the  strength  of  the  splice  sites  and  the  score  of  the 
exons  produced  by  an  interpolated  Markov  model  (IMM).  The  meth¬ 
ods  for  producing  the  IMM  and  splice  site  scores  are  described  next, 
followed  by  the  description  of  the  dynamic  programming  algorithm 
that  uses  these  scores. 

2.1.  Interpolated  Markov  Models 

Markov  chains  are  a  family  of  methods  for  computing  the  proba¬ 
bility  of  an  event  based  on  a  fixed  number  of  previous  events.  (More 
formally,  a  Markov  chain  is  a  sequence  of  random  variables  Xit  where 
the  probability  distribution  for  each Xt  depends  only  on  . . .  ,Xt-k 

for  some  constant  k.)  In  the  context  of  DNA  sequence  analysis, 
Markov  chains  predict  a  base  by  examining  a  fixed  number  of  bases 
just  prior  to  that  base  in  the  sequence.  The  most  common  type  of 
Markov  chain  is  a  fixed-order  chain,  in  which  the  number  of  previous 
bases  to  examine  is  a  constant.  For  example,  a  fifth-order  Markov 
chain  will  predict  a  base  by  looking  at  the  five  previous  bases. 
Markov  chains,  and  fifth-order  chains  in  particular,  have  proven  to 
be  effective  at  gene  prediction  in  bacterial  genomes  (Borodovsky  and 
Mclninch,  1993;  Borodovsky  et  al ,  1995). 

IMMs  are  a  generalization  of  fixed-order  Markov  chains.  The  main 
distinction  is  that  rather  than  deciding  in  advance  how  many  bases 
to  consider  for  each  prediction,  these  models  will  use  varying  num¬ 
bers  of  bases  for  each  prediction.  In  some  contexts  they  will  use  5 
bases,  while  in  others  they  might  use  6  or  more  bases,  and  in  yet 
other  cases  they  may  use  4  or  fewer  bases.  This  allows  IMMs  to  be 
sensitive  to  how  common  a  particular  oligomer  is  in  a  given  genome. 
In  a  given  genome,  many  5-mers  might  occur  rarely  and  should  not 
be  used  for  prediction;  here  the  IMM  will  fall  back  on  a  shorter 
Markov  chain.  On  the  other  hand,  certain  8-mers  may  occur  very 
frequently,  and  for  those  the  IMM  can  use  this  longer  context  and 
make  a  better  prediction.  In  addition,  the  IMM  can  combine  the 
evidence  from  the  eighth-order  Markov  chain  and  the  fifth-order 
chain  in  such  cases.  Thus  it  has  all  the  information  available  to  a 
fifth-order  chain  plus  additional  information.  It  is  also  worth  noting 
that  both  IMMs  and  fifth-order  Markov  chains  should  outperform 
methods  based  on  codon  usage  statistics.  (Cf.  Saul  and  Battistutta 


(1988),  a  codon  usage  method  specific  to  P.  falciparum.  Note  that  at 
the  time  of  that  work,  much  less  Plasmodium  data  were  available, 
and  higher-order  statistics  might  have  been  inaccurate  as  a  result.) 

IMMs  form  the  basis  of  the  Glimmer  system  for  finding  genes  in 
bacteria  and  archaea  (Salzberg  et  al,  1998b).  Glimmer  correctly 
identifies  approximately  98%  of  the  genes  in  bacteria  without  any 
human  intervention  and  with  a  very  limited  number  of  false-posi- 
tives.  It  has  been  used  as  the  gene  finder  for  Borrelia  burgdorferi 
(Fraser  et  al,  1997),  Treponema  pallidum  (Fraser  et  al,  1998), 
Chlamydia  trachomatis  (Stephens  et  al,  1998),  Thermotoga  mari- 
tima  (Nelson  et  al,  submitted  for  publication),  and  others.  Based  on 
the  success  of  Glimmer  in  bacterial  sequence  annotation,  we  thought 
that  IMMs  should  make  a  good  foundation  for  eukaryotic  gene  find¬ 
ing.  This  is  particularly  true  of  small  eukaryotes  like  P.  falciparum 
in  which  the  gene  density  is  intermediate  between  that  of  pro¬ 
karyotes  and  higher  eukaryotes. 

Details  of  how  to  construct  an  IMM  for  sequence  data  can  be  found 
in  the  original  Glimmer  publication  (Salzberg  et  al,  1998b);  Glim¬ 
merM  uses  the  same  IMM  algorithm  as  that  described  there.  In  brief, 
GlimmerM  builds  IMMs  from  a  set  of  DNA  sequences  chosen  for 
training.  For  coding  regions,  it  builds  three  separate  IMMs,  one  for 
each  codon  position.  (This  is  known  as  a  3-periodic  Markov  model 
(Borodovsky  and  Mclninch,  1993).)  These  IMMs  include  zeroth- 
through  eighth-order  Markov  chains,  as  well  as  weights  computed 
for  every  oligomer  of  8  bases  or  less  that  appears  in  the  training  data. 
These  weights  and  Markov  models  are  interpolated  to  produce  a 
score  for  each  base  in  any  potential  coding  sequence.  The  logs  of 
these  scores  are  summed  to  score  each  coding  region. 

2.2.  Splice  Site  Identification 

The  approach  used  by  GlimmerM  to  determine  the  splice  sites  is 
similar  to  that  used  in  the  Morgan  human  gene  finding  system 
(Salzberg  et  al,  1998a).  A  second-order  Markov  chain  model  is  used 
to  score  a  16-base  region  around  donor  sites  and  a  29-base  region 
around  acceptor  sites.  For  both  donor  and  acceptor  sites  in  P.  falci¬ 
parum,  a  wide  range  of  different  regions  were  tested,  and  these  sizes 
performed  best.  Two  second-order  Markov  models  were  built  for  each 
type  of  site.  First,  a  “true”  Markov  model  was  created  from  existing 
data  on  known  5'  and  3'  consensus  sites.  These  data  were  collected 
by  exhaustively  combing  the  literature  for  every  documented  exon- 
intron  boundary.  A  “false”  Markov  model  was  built  from  a  large 
number  of  randomly  chosen  false  splice  sites,  i.e.,  sequences  that 
contained  the  consensus  GT  or  AG  dinucleotide  but  that  were  not 
true  splice  sites.  The  score  of  a  site  sh  si+i, . . . ,  Sj  was  computed  by 
each  Markov  model  according  to  the  formula 


S{i,j)  =  2 

k=  i 


where 

Msk~  ln(/I(s*-2,  a*-i>  h)/f((sk-2,  s*_i),  h  —  1)), 

and  f(s ,  k)  is  the  frequency  of  substring  s  ending  at  location  k.  Note 
that  for  the  leftmost  position  in  the  splice  site  region,  M  is  taken  to 
be  the  probability  given  by  the  zeroth-order  Markov  model,  and  for 
the  second  position,  M  is  given  by  the  first-order  model.  The  score  for 
a  given  splice  site  is  computed  by  taking  the  difference  of  the  scores 
obtained  from  the  true  site  Markov  model  and  the  false  site  model. 

After  building  the  models,  we  scored  all  the  true  splice  sites  and  a 
large  selection  of  randomly  chosen  false  sites.  We  then  set  minimum 
cut-off  scores  to  identify  correctly  most  (or  all)  true  sites  and  mea¬ 
sured  how  many  false-positives  we  would  expect  with  various  thresh¬ 
olds.  The  splice  sites  for  training  the  Markov  models  were  taken  from 
the  119  genes  (described  under  Results  and  Discussion)  used  to  train 
the  IMMs,  all  of  which  had  laboratory  evidence  to  support  them. 
These  genes  contained  only  81  introns  in  total,  which  did  not  gener- 
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FIG.  1.  Trade-off  between  false-positive  rates  and  false-negative  rates  for  the  Markov  chain  method  that  recognizes  exon-intron  splice 
sites.  Data  represent  the  accuracy  on  sites  annotated  in  chromosome  2  of  P.  falciparum. 


ate  enough  data  to  produce  a  very  reliable  second-order  Markov 
model.  Therefore,  after  an  initial  training  pass  using  the  81  introns, 
we  used  GlimmerM  itself  to  predict  additional  introns  in  chromo¬ 
some  2,  selected  the  best  of  these,  and  added  them  to  the  training  set. 
Of  course  this  is  a  “circular”  training  protocol,  butthis  represents  our 
attempt  to  squeeze  the  best  performance  we  could  from  limited  data. 
As  the  sequencing  of  the  remaining  chromosomes  continues,  and  as 
ESTs  yield  further  hard  evidence  on  introns,  the  available  pool  of 
reliable  data  for  training  the  splice  site  models  should  grow  dramat¬ 
ically.  Alignments  with  protein  sequences  from  other  organisms  will 
provide  additional  evidence  about  intron  locations.  The  Markov 
chain  models  will  consequently  improve  in  accuracy.  We  intend  to 
continue  retraining  these  models  as  the  genome  sequencing 
progresses. 

Figure  I  shows  the  trade-off  in  thresholds  for  the  splice  site  rec¬ 
ognition  function  in  P.  falciparum  and  shows  the  trade-off  between 
sensitivity  and  selectivity  for  the  Markov  chain  method  on  the  143 
donor  and  acceptor  sites  in  chromosome  2.  Acceptor  sites  are  much 
easier  to  recognize:  with  a  false-negative  rate  of  0%  (corresponding  to 
a  sensitivity  of  100%,  meaning  that  all  true  sites  will  be  recognized), 
the  false-positive  rate — the  percentage  of  AG  dinucleotides  that  will 
incorrectly  be  called  acceptor  sites — is  just  0.56%.  For  donor  sites,  a 
0%  false-negative  rate  corresponds  to  a  rather  high  6.1%  false¬ 
positive  rate.  Setting  the  system  so  that  it  misses  4  of  the  143  (2.8%) 
donor  sites  in  chromosome  2  would  reduce  this  false-positive  rate  to 
2.9%.  The  Markov  thresholds  used  here  are  set  so  that  no  true  splice 
sites  will  be  missed. 

2.3.  Dynamic  Programming 

GlimmerM’s  use  of  dynamic  programming  allows  it  to  prune  out 
a  large  number  of  possible  exon-intron  combinations  and  focus 
its  analysis  only  on  relatively  high-scoring  combinations  (called 
“parses”).  The  input  to  the  algorithm  is  any  genomic  DNA  sequence 
in  FASTA  format;  small  sequences  as  well  as  entire  chromosomes 
can  be  input.  The  output  is  a  partitioning  of  the  DNA  into  coding 
regions  interleaved  with  noncoding  regions,  on  both  the  main  and 
the  complementary  strands  of  the  sequence. 

As  in  many  other  gene  finders  (Salzberg,  1998),  there  are  a  number 
of  assumptions  used  by  GlimmerM  when  predicting  genes  in  the 


DNA  sequence.  The  main  assumptions  are  (1)  the  coding  region  of 
every  gene  begins  with  a  start  codon  ATG,  (2)  a  gene  has  no  in-frame 
stop  codons  except  the  very  last  codon,  and  (3)  each  exon  is  in  a 
consistent  reading  frame  with  the  previous  exon.  These  constraints 
significantly  enhance  the  efficiency  of  computing  the  optimal  gene 
models,  by  restricting  the  search  space  of  the  DP  algorithm.  On  the 
other  hand,  genuine  frameshifts  cannot  be  detected  by  the  system. 

The  dynamic  programming  algorithm  fills  in  a  structure  Parse,  in 
which  each  element  Parse  [t,  n,  3]  denotes  the  optimal  parse  of  the 
subsequence  that  begins  at  location  n  and  ends  at  the  stop  codon  at 
location  S.  The  variable  t  specifies  the  type  of  signal  at  n,  which  can 
be  donor,  acceptor,  start  (codon),  or  stop  (codon).  More  specifically, 
Parse  is  an  ordered  list  of  labeled  positions  indicating  the  end-points 
of  a  set  of  exons.  For  example, 

Parsefstart,  100,  540] 

=  (start,  100),  (donor,  240),  (acceptor,  380),  (stop,  540) 

indicates  a  pair  of  exons  at  positions  [100  .  . .  239]  and  [380  . . .  539]. 
A  complete  gene  model  is  represented  as  a  list  Parse  [start,  n ,  S]. 
Other  elements  are  partial  parses,  beginning  at  a  location  of  type  t 
( t  =£  start)  and  ending  at  a  stop  codon  S. 

The  DP  algorithm  processes  the  input  sequence  left  to  right,  look¬ 
ing  for  stop  codons.  At  each  stop  codon  3,  it  searches  back  in  the  5' 
direction  and  finds  all  possible  genes  ending  at  that  stop.  It  chooses 
the  highest  scoring  gene  to  store  in  Parse.  More  concisely, 

Parse[t ,  ny  S ]  =  ( t ,  n),  Parse[tnexl ,  i ,  S]y 
where  i  is  the  location  that  achieves  the  maximum  score 
max{3core((£,  n ),  Parse[t  nexit  If  S])}, 

*<i<S 

and  t  is  the  type  logically  following  the  type  t  in  a  parse.  For 
example,  if  t  =  acceptor,  then  tMXt  can  be  either  donor  or  stop.  Score 
(Parse  [t,  h,  3])  is  the  score  given  by  the  IMMs  to  the  coding  region 
obtained  by  concatenating  all  the  exons  in  the  parse  delimited  by 
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Parse  [t,  n,  S].  For  example,  if  n  is  an  acceptor  site,  the  algorithm 
considers  all  sites  i  that  can  follow  n  and  chooses  the  best  one.  These 
would  include  donor  sites,  if  n  is  the  beginning  of  an  internal  exon, 
and  stop  codons,  if  n  is  the  final  coding  exon.  Because  the  algorithm 
works  backward  from  each  stop  codon  S,  the  entry  Parse  [£next>  L 
computed  prior  to  Parse  [t,  n,  S] .  The  only  positions  that  are  consid¬ 
ered  as  possible  donor  and  acceptor  sites  are  those  that  score  above 
the  threshold  determined  by  the  Markov  chains  described  previ¬ 
ously. 

The  algorithm  incorporates  special  cases  for  each  of  the  four  types 
t  to  prune  the  search  space  further.  These  are  as  follows: 

1.  If  the  interval  (n  ...  0  is  the  coding  portion  of  an  exon,  its  IMM 
score  must  exceed  a  fixed,  preset  threshold. 

2.  If  two  internal  exons  (n  . . .  i\)  and  (n  . . .  i2)  both  result  in 
identical  IMM  scores,  choose  the  one  that  maximizes  the  length  of 
the  coding  part  of  the  parse.  Note  that  this  rule  makes  GlimmerM 
prefer  longer  gene  models. 

3.  If  (n  ...  r)  is  an  intron,  then  its  AT  content  must  be  at  least  70%. 
This  constraint  is  based  on  the  observation  that  all  P.  falciparum 
introns  in  the  training  set  had  an  AT  content  of  above  70%,  with  only 
1%  of  introns  having  an  AT  content  under  75%.  In  contrast,  P. 
falciparum  exons  have  an  AT  content  of  70-75%. 

4.  The  length  of  an  intron  must  be  between  50  and  1500  bp;  73  and 
1066  bp  were  the  extreme  lengths  for  the  introns  in  the  training  set. 

5.  The  total  length  of  the  coding  portions  of  a  gene  model  repre¬ 
sented  in  Parse  [start,  n,  S)  must  be  greater  than  200  bp. 

6.  If  n  is  a  stop  codon,  the  algorithm  searches  backward  for  all 
gene  models  ending  at  n .  Many  stop  codons  can  be  quickly  eliminated 
because  they  follow  too  closely  another  stop  codon  in  the  same 
reading  frame.  Thus  there  is  no  way  to  create  a  gene  model  ending  at 
these  stops — any  genes  ending  at  the  stop  would  be  too  short.  The 
high  AT  content  of  P .  falciparum  and  the  resulting  high  frequency  of 
stop  codons  make  this  step  particularly  effective. 

An  attempt  was  made  to  use  IMMs  to  score  introns  as  well  as  exons, 
but  this  did  not  improve  the  results.  Therefore,  when  t  is  a  donor  site 
and  taex t  is  an  acceptor,  we  have 

Score  ((donor,  n),  Parse  [acceptor,  i,  S]) 

=  Score  {Parse [acceptor,  i,  «S]). 

The  algorithm  is  run  separately  on  both  the  direct  and  the  com¬ 
plementary  strands  of  the  input.  GlimmerM  then  makes  one  more 
pass  over  the  list  of  putative  genes  to  reject  overlapping  genes.  If 
genes  overlap  by  less  than  a  fixed  amount  (30  bp  by  default),  then  the 
overlap  is  ignored,  and  both  genes  are  reported  in  the  output.  Most 
overlapping  genes  are  competing  gene  models  that  share  a  stop 
codon  and  have  different  exon  locations.  Genes  that  overlap  by  more 
than  30  bp  are  rescored  using  the  IMM,  and  the  gene  with  the  best 
score  is  retained.  If  the  scores  of  two  or  more  overlapping  models 
differ  from  the  maximum  score  by  less  than  a  small  preset  amount, 
then  GlimmerM  considers  the  scores  equivalent  and  outputs  all  the 
models  as  possible  genes.  In  these  instances,  it  marks  the  longest 
gene  as  the  preferred  model. 

2.4 .  Code  Availability 

The  complete  GlimmerM  system  is  available  from  the  authors;  it 
has  already  been  shared  with  other  malaria  genome  sequencing 
centers.  The  code  includes  routines  for  retraining  the  system  on  data 
from  other  organisms.  A  version  of  the  system  trained  on  A  thaliana 
genes  is  currently  under  development.  Total  processing  time  to  find 
all  genes  in  malaria  chromosome  2  (approximately  one  million  nu¬ 
cleotides)  is  about  50  min  on  a  Pentium  450  processor  running  Linux. 

2.5.  Annotating  a  Genome 

In  its  current  form,  GlimmerM  produces  multiple  gene  models  for 
some  genes.  When  no  database  matches  and  no  other  computational 


evidence  were  found  to  support  a  GlimmerM  prediction,  the  chromo¬ 
some  2  annotation  reflects  the  highest  scoring  model.  Although  many  of 
these  are  likely  to  be  correct,  it  is  undoubtedly  the  case  that  some  are 
not.  Further  investigation  is  required  to  confirm  these  predictions  (but 
see  below  for  laboratory  evidence  confirming  a  small  subset). 

Thp  GlimmerM  algorithm  was  used  as  one  of  a  suite  of  tools. 
Accurate  gene  identification  depends  on  using  every  tool  available, 
and  the  description  here  should  not  be  taken  as  implying  that  Glim¬ 
merM  alone  can  find  all  genes  in  P.  falciparum  or  any  other  genome. 
However,  it  was  a  central  component  in  a  larger  strategy.  Other 
important  computational  tools  used  by  the  malaria  chromosome  2 
team  were  as  follows:  (1)  searches  of  a  nonredundant  protein  se¬ 
quence  database  using  gapped  BLAST  and  PSI-BLAST  (Altschul  et 
al,  1990,  1997);  (2)  gapped  alignments  of  DNA  to  protein  and  EST 
sequence  databases  using  DDS  and  DPS  (Huang  et  al.,  1997);  (3) 
prediction  of  putative  signal  peptides  using  SignalP  (Nielsen  et  al, 
1997);  (4)  prediction  of  transmembrane  domains  with  PHThtm  (Rost 
et  al,  1995);  (5)  prediction  of  nonglobular  structures  with  SEG 
(Wootton  and  Federhen,  1996);  and  (6)  a  graphical  tool  to  allow 
annotators  to  view  all  the  evidence  together.  In  addition,  the  project 
used  additional  aligment  tools  developed  at  The  Institute  for 
Genomic  Research  to  detect  ffameshift  errors:  these  tools  allow  an 
annotator  to  detect  when  a  sequence  alignment  extends  beyond  the 
start  and  stop  codons  indicated  by  other  tools.  In  some  cases  this 
indicates  errors  in  sequencing,  which  can  be  corrected;  in  other  cases 
it  indicates  either  a  genuine  frameshift  that  occurs  during  transla¬ 
tion  or  a  mutation  that  has  changed  the  length  of  the  translated 
protein.  Any  comprehensive  annotation  effort  needs  these  computa¬ 
tional  tools  and  more  to  produce  reasonably  accurate  gene  annota¬ 
tions. 

3.  RESULTS  AND  DISCUSSION 

GlimmerM  was  used  as  the  primary  gene  finder  for 
chromosome  2  of  P.  falciparum .  Chromosome  2  has  209 
protein-coding  genes  spread  over  approximately  one 
million  bases,  for  a  gene  density  of  one  gene  per  4.5  kb 
(1/4.5  kb).  This  contrasts  with  a  density  of  1/kb  in 
bacteria,  1/2  kb  in  yeast,  1/7  kb  in  C.  elegans,  and  1/50 
kb  (estimated)  in  human.  Of  the  209  protein-coding 
genes,  43%  had  at  least  one  intron,  and  those  genes 
with  introns  usually  had  just  one  or  two  introns  (Gard¬ 
ner  et  al,  1998).  Below  we  attempt  to  quantify  Glim- 
merM’s  accuracy  on  these  genes. 

3.1.  Training 

To  train  the  IMM,  we  needed  to  collect  as  much 
coding  sequence  as  possible  from  P .  falciparum  itself. 
We  exhaustively  surveyed  the  literature  to  collect  ev¬ 
ery  complete  sequence  that  was  backed  by  laboratory 
evidence.  Our  survey  collected  119  complete  coding 
sequences  from  108  GenBank  entries  representing  all 
14  chromosomes,  of  which  just  6  genes  came  from  chro¬ 
mosome  2.  (This  database  is  available  by  e-mail  upon 
request  from  the  authors.)  Note  that  by  length,  chro¬ 
mosome  2  comprises  approximately  3%  of  the  genome, 
so  it  is  unsurprising  that  just  6/119  genes  were  from 
chromosome  2.  GenBank  contains  more  than  108  en¬ 
tries  from  P.  falciparum,  but  other  entries  do  not  have 
clear  evidence  supporting  their  splice  sites.  This  train¬ 
ing  set  provided  the  initial  data  for  the  splice  site 
models  as  well. 

An  important  point  to  emphasize  here  is  that  P. 
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TABLE  1 


Performance  of  GlimmerM  on  Genes  Whose  Structure  Is  Completely  Known 
from  Independent  Laboratory  Evidence 


Name 

Len 

Intr 

Comment 

Common  name 

PFBOlOOc 

654 

1 

Perfect  match 

Knob-associated  His-rich  prt 

PFB0295w 

471 

0 

Perfect  match 

Adenylosuccinate  lyase  (00) 

PFB0300c 

272 

0 

Perfect  match 

Merozoite  surface  antigen  MSP-2 

PFB0305c 

272 

1 

Perfect  match 

Merozoite  surface  antigen  MSP-5  (EGF  domain) 

PFB0310c 

272 

1 

Perfect  match,  highest  score  from  5  models 

Merozoite  surface  antigen  MSP-4  (EGF  domain) 

PFB0340c 

997 

3 

Perfect  match,  second  highest  score  from  4  models 

SERA  antigen/papain-like  Protease  with  active  Ser 

PFB0405w 

3135 

0 

Perfect  match,  higher  score  from  2  models 

Transmission  blocking  Target  antigen  PfS230 

Note.  All  seven  genes  had  perfect  matches  to  the  system’s  predictions,  meaning  that  the  start  codon,  stop  codon,  and  every  splice  site  were 
correctly  predicted.  The  column  headings  give  the  gene  name,  its  length  in  amino  acids,  number  of  introns  (Intr),  a  comment  on  GlimmerM’s 
prediction,  and  the  common  name  of  the  protein. 


falciparum  has  an  unusually  high  82%  AT  content.  As 
a  consequence  of  this  high  AT  content,  stop  codons  are 
very  frequent  (e.g.,  TAA  will  occur  especially  often)  in 
noncoding  DNA.  This  makes  it  much  more  likely  that 
long  open  reading  frames  (ORFs)  represent  coding  se¬ 
quence.  This  fact  was  used  to  generate  additional 
training  data  for  GlimmerM:  ORFs  greater  than  500  bp 
in  the  chromosome  2  sequence  were  assumed  to  be 
coding  regions  and  were  used  in  the  IMM  training. 
These  were  added  to  the  list  generated  by  the  litera¬ 
ture  search. 

3.2.  Accuracy  on  Known  Genes 

The  209  genes  included  in  the  chromosome  2  anno¬ 
tation  were  found  with  GlimmerM’s  help.  To  evaluate 
the  accuracy  of  the  system,  it  is  helpful  to  consider  only 
those  genes  from  this  set  for  which  independent  evi¬ 
dence  can  be  found  to  confirm  their  existence. 

The  best  way  to  measure  the  program’s  accuracy  is  to 
consider  its  accuracy  on  those  proteins  whose  exon- 
intron  structure  is  known  precisely  from  laboratory 
studies.  There  are  seven  genes  from  chromosome  2  of 
P.  falciparum  that  currently  fit  into  this  category;  i.e., 
the  sequence  from  start  to  stop  has  been  completely 
characterized.  Of  these  seven,  six  were  included  in  the 
training  set,  and  one  (PFBOlOOc)  was  not. 

GlimmerM’s  performance  on  this  small  set  of  genes  is 
shown  in  Table  1.  For  the  two-exon  gene  PFBOlOOc,  the 
only  independently  confirmed  gene  that  was  not  in¬ 
cluded  in  the  training  set,  the  system  predicted  only 
one  model:  the  correct  one.  For  all  seven  of  the  genes, 
GlimmerM’s  output  contained  a  model  that  matched 
perfectly.  For  four  of  the  genes,  the  correct  model  was 
the  only  one  output  by  the  system.  For  PFB0310c  and 
PFB0405c,  GlimmerM  produced  five  and  two  competing 
models,  respectively,  but  in  each  case  the  highest  scor¬ 
ing  one  was  correct.  Only  for  PFB0340c,  a  four-exon 
gene,  was  GlimmerM’s  correct  model  not  the  highest 
scoring  one.  The  system  gave  a  slightly  higher  score  to 
a  model  that  used  a  different  donor  site  for  the  first 
exon.  GlimmerM’s  alternate  prediction  would  have  a 
23-aa  insertion  in  this  997-aa  protein. 


3.3.  Laboratory  Tests 

An  ideal  way  of  measuring  the  accuracy  of  GlimmerM 
precisely  would  be  to  test  each  of  its  predictions  in  the 
laboratory  to  see  whether  they  are  expressed  as  pre¬ 
dicted.  Although  a  complete  test  of  all  predictions 
would  be  difficult  and  time-consuming,  one  careful  set 
of  experiments  was  conducted  as  part  of  the  chromo¬ 
some  2  study. 

Because  many  of  the  proteins  predicted  by  Glim¬ 
merM  had  unusual  nonglobular  domains,  the  chromo¬ 
some  2  project  team  ran  a  reverse  transcriptase  (RT- 
PCR)  experiment  for  13  of  these  genes  (Gardner  et  al., 
1998)  to  determine  whether  or  not  they  were  real. 
These  genes  are  shown  in  Table  2.  The  RT-PCR  focused 
its  attention  on  nonglobular  domains,  not  entire  pro¬ 
teins,  so  it  could  not  confirm  every  detail  of  the  Glim¬ 
merM  predictions.  In  particular,  it  did  not  test  the 
exon-intron  boundaries  for  the  two  genes  in  this  set 

TABLE  2 


The  Set  of  Genes  with  Nonglobular  Domains  for 
Which  RT-PCR  Experiments  Were  Conducted  to  Con¬ 
firm  Expression 


Name 

Length 

Intr 

Common  name 

PFB0130w 

538 

0 

Prenyl  transferase 

PFB0145C 

1979 

0 

Hypothetical  protein 

PFB0180w 

560 

1 

prt  with  5 '-3'  exonuclease  domain 

PFB0265c 

1516 

0 

RAD2  endonuclease 

PFB0380c 

2010 

0 

Phosphatase  (acid  phosphatase 
family) 

PFB0435c 

1138 

7 

Predicted  amine  transporter 

PFB0500c 

235 

0 

RAB  GTPase 

PFB0520w 

1233 

0 

Novel  protein  kinase 

PFB0525w 

610 

0 

Asparaginyl-tRNA  synthetase 

PFB0685C 

885 

0 

ATP-dependent  acyl-CoA 
synthetase 

PFB0720C 

899 

0 

Ori.  recognition  complex  subunit  5 
(ATPase) 

PFB0755w 

1398 

0 

Hypothetical  protein 

PFB0880w 

426 

0 

FAD-dependent  oxidoreductase 

Note.  Length  is  shown  in  amino  acids,  and  Intr  gives  the  number 
of  introns.  In  the  two  genes  containing  introns,  the  nonglobular 
domains  are  contained  within  exons. 
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that  contain  introns,  because  the  nonglobular  domains 
in  those  genes  do  not  cross  those  boundaries.  This 
experiment  confirmed  that  all  13  of  the  nonglobular 
domains  are  expressed;  i.e.,  the  predictions  for  those 
regions  were  correct.  To  our  knowledge,  this  is  the  first 
time  ever  that  computational  gene  predictions  pro¬ 
vided  the  impetus  for  experiments  that  in  turn  con¬ 
firmed  the  predictions. 

Eleven  of  these  13  genes  have  sequence  homology  to 
known  proteins  from  other  organisms.  It  is  worth  not¬ 
ing  that  the  nonglobular  domains  of  the  P.  falciparum 
proteins  did  not  occur  in  the  homologs.  For  example, 
PFB0180c  contains  a  176-amino-acid  nonglobular  in¬ 
sert  that  is  absent  from  four  homologous  bacterial  ex¬ 
onuclease  domains  (shown  in  Fig.  2  of  Gardner  et  al., 
(1998)).  GlimmerM’s  prediction  for  this  gene  was  con¬ 
firmed  by  amplifying  and  then  sequencing  a  region 
that  contained  the  nonglobular  domain.  This  example 
points  out  that  the  presence  of  a  homologous  protein 
sequence  does  not  always  produce  an  accurate  gene 
prediction. 

3.4.  Comparison  on  Genes  with  Homologs 

Of  the  209  genes  in  chromosome  2,  119  have  homol¬ 
ogous  proteins  in  the  public  sequence  databases.  (The 
training  set  also  contained  119  genes,  but  the  identity 
of  these  two  numbers  is  merely  coincidence.)  The  exis¬ 
tence  of  homologs,  which  come  from  a  wide  range  of 
other  organisms,  provides  strong  independent  evidence 
that  these  genes  are  real.  We  therefore  used  these 
genes  to  make  further  measurements  of  GlimmerM’s 
accuracy. 

Of  the  119  genes,  7  were  already  mentioned:  these 
are  the  genes  from  chromosome  2  whose  exon— intron 
structure  was  known  from  previously  published  labo¬ 
ratory  studies.  Six  of  those  were  included  in  the  train¬ 
ing  set,  which  leaves  113  genes  in  chromosome  2  that 
were  not  included  in  the  training  set  and  for  which  we 
have  good  hints  of  their  exon-intron  structure.  Be¬ 
cause  these  are  homologs,  parts  of  some  genes  may  not 
align  well,  making  the  predicted  exon-intron  structure 
less  certain. 

GlimmerM  finds  98  of  these  113  genes  (87%)  exactly; 
i.e.,  the  positions  of  the  start  codon,  the  boundaries  of 
each  exon  and  intron,  and  the  stop  codon  correspond  to 
what  is  indicated  by  the  alignments  to  homologous 
genes.  Of  these,  22  have  competing  gene  models  that 
score  higher,  meaning  that  a  human  annotator  had  to 
examine  the  output  and  decide,  based  on  the  align¬ 
ment,  to  use  a  model  other  than  the  highest-scoring 
one. 

Of  the  15  genes  that  GlimmerM  did  not  find  exactly, 
14  were  found  but  had  slightly  modified  coding  regions. 
Seven  intronless  genes  were  predicted  with  incorrect 
start  codons.  Three  2-exon  genes  were  broken  into  two 
genes  each.  Four  3-exon  genes  were  predicted  with  an 
incorrect  first  exon  but  correct  second  and  third  exons. 

Only  one  of  the  genes  with  homologs,  ribosomal  pro¬ 


tein  S30,  was  missed  completely;  ribosomal  proteins 
often  have  a  strikingly  different  composition  from 
other  genes  and  are  known  to  be  difficult  for  content- 
based  gene  finders  to  locate.  These  will  not  be  missed 
as  long  as  genomic  data  are  searched  against  data¬ 
bases  of  known  ribosomal  proteins. 

In  summary,  chromosome  2  contains  113  genes  that 
were  not  included  in  the  set  of  119  genes  used  to  train 
GlimmerM’s  IMM.  Portions  of  some  of  these  genes, 
those  with  ORFs  greater  than  500  bp,  were  extracted 
automatically  and  added  to  the  IMM;  this  portion  of 
the  training  is  fully  automatic  and  requires  no  human 
intervention.  The  splice  site  training  also  included 
some  data  from  chromosome  2,  as  explained  above.  A 
similar  procedure  can  be  performed  on  future  chromo¬ 
somes  to  extract  additional  splicing  data:  first  use  a 
sequence  alignment  program  to  find  homologous  genes, 
extract  splice  sites  from  those,  and  add  those  splice 
sites  to  the  Markov  chain  models.  This  will  allow  users 
of  the  system  to  improve  the  system’s  performance 
before  making  a  final  run  on  their  chromosomes.  As¬ 
suming  this  or  a  similar  protocol  is  followed,  the  esti¬ 
mates  given  here  should  extrapolate  reliably  to  those 
chromosomes.  Of  the  113  genes  with  homologs,  Glim¬ 
merM  is  able  to  annotate  automatically  76  (66%)  if  its 
top-scoring  prediction  is  assumed  correct.  If  a  human 
annotator  is  available  to  confirm  or  reject  predictions, 
then  this  number  grows  to  87%  (98/113).  In  most  cases 
the  differences  between  competing  models  are  small, 
involving  one  splice  site  or  the  start  codon.  Information 
from  alignments  or  from  other  programs— for  example, 
identification  of  signal  peptides— allowed  the  human 
annotators  to  override  GlimmerM’s  first  choice  in  se¬ 
lected  cases. 

3.5.  Comparison  to  Chromosome  2  Annotation 

Of  the  209  genes  currently  annotated  for  chromo¬ 
some  2,  GlimmerM  finds  178  exactly.  Of  these,  40  have 
competing  gene  models  that  score  higher;  human  an¬ 
notators  chose  a  different  model  for  the  final  annota¬ 
tion.  Of  the  remaining  31  genes,  GlimmerM  finds  the 
stop  codons  correctly  for  14.  Different  starts  appear  in 
the  final  annotation  for  several  reasons,  for  example, 
the  existence  of  a  match  to  a  protein  sequence  that 
starts  at  a  different  start  codon.  (Note  that  it  is  possible 
that  GlimmerM  is  still  correct  in  these  cases.)  The 
system  finds  the  correct  start  but  the  wrong  stop  codon 
for  4  genes;  this  occurs  in  multiexon  genes  in  which  a 
splice  site  was  missed  and  one  of  the  exons  was  incor¬ 
rectly  extended  until  it  hit  a  stop  codon.  The  11  remain¬ 
ing  partial  hits  are  cases  for  which  GlimmerM  predicts 
some  but  not  all  exons  correctly;  for  example,  several 
multiexon  genes  are  each  broken  into  two  separate 
genes. 

Only  2  of  the  209  genes  are  missed  completely.  One 
is  ribosomal  protein  S30,  which  was  mentioned 
above.  The  second  is  a  predicted  integral  membrane 
protein  of  192  aa  predicted  by  a  preliminary  version 
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of  GlimmerM  (before  retraining  the  splice  site  mod¬ 
els).  A  separate  program  was  used  to  predict  the 
function  of  this  protein;  it  did  not  align  to  any  known 
sequences. 

The  improved  splice  site  Markov  models  resulted  in 
GlimmerM’s  generating  41  fewer  gene  models  than  be¬ 
fore.  In  addition  to  the  one  missed  gene  just  described, 
it  generated  5  new  gene  models.  Of  these,  one  appears 
to  encode  a  genuine  protein,  and  we  are  currently 
investigating  this  to  see  if  it  should  be  added  to  the 
published  annotation. 

A  significant  caveat  to  include  with  these  results  is 
that  GlimmerM  often  produces  multiple  competing 
models  that  the  human  annotator  must  resolve.  Most 
genes  with  three  or  more  exons  result  in  multiple  mod¬ 
els.  The  system  indicates  which  model  scores  the  high¬ 
est,  but  as  indicated  above,  40  of  the  “correct”  gene 
models  had  alternative  parses  that  scored  higher. 
These  alternative  parses  share  some  exons  but  use 
different  splice  sites  for  others.  A  human  annotator 
looking  at  additional  evidence,  such  as  alignments  to 
homologous  proteins  or  predictions  of  signal  peptides, 
was  able  to  overrule  the  system’s  top  choice  in  these 
cases.  It  is  likely  that  in  other  cases  where  no  evidence 
besides  GlimmerM’s  prediction  is  available,  some  of  the 
published  annotation  may  still  be  in  error  (all  such 
proteins  are  annotated  as  hypotheticals).  After  each  set 
of  multiple  gene  models  was  collapsed  into  one  model, 
the  gene  list  still  contains  266  genes.  (All  of  the  models 
can  be  downloaded  on  the  Web  at  www.tigr.org/ 
~salzberg/GlimmerMchr2output.html.)  These  means 
that,  since  only  209  genes  appeared  in  the  final  anno¬ 
tation,  the  annotators  eliminated  another  57  gene 
models  entirely  from  the  output.  These  decisions  were 
somewhat  subjective:  frequently  the  putative  genes 
were  short  or  they  consisted  mostly  of  low-complexity 
sequence,  and  this  was  not  enough  to  convince  the 
human  annotators  that  the  genes  were  real.  In  many 
cases  the  annotators  are  probably  correct,  but  it  is 
simply  impossible  at  this  point  to  say  with  confidence 
that  all  of  the  deleted  genes  are  false-positives.  Only 
further  evidence  will  allow  us  to  decide,  but  this  makes 
clear  the  importance  of  continuing  to  update  and  im¬ 
prove  genome  annotation  over  time. 
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Detailed  restriction  maps  of  microbial  genomes  are  a  valuable  resource  in  genome  sequencing  studies  but  are 
toilsome  to  construct  by  contig  construction  of  maps  derived  from  cloned  DNA.  Analysis  of  genomic  DNA 
enables  large  stretches  of  the  genome  to  be  mapped  and  circumvents  library  construction  and  associated  cloning 
artifacts.  We  used  pulsed-field  gel  electrophoresis  purified  Plasmodium  falciparum  chromosome  2  DNA  as  the 
starting  material  for  optical  mapping,  a  system  for  making  ordered  restriction  maps  from  ensembles  of 
individual  DNA  molecules.  DNA  molecules  were  bound  to  derivatized  glass  surfaces,  cleaved  with  Nhe\  or  BamHl, 
and  imaged  by  digital  fluorescence  microscopy.  Large  pieces  of  the  chromosome  containing  ordered  DNA 
restriction  fragments  were  mapped.  Maps  were  assembled  from  50  molecules  producing  an  average  contig  depth 
of  15  molecules  and  high-resolution  restriction  maps  covering  the  entire  chromosome.  Chromosome  2  was  found 
to  be  9 76  kb  by  optical  mapping  with  Nhe\,  and  946  kb  with  BamHl,  which  compares  closely  to  the  published 
size  of  947  kb  from  large-scale  sequencing.  The  maps  were  used  to  further  verify  assemblies  from  the  plasmid 
library  used  for  sequencing.  Maps  generated  in  silico  from  the  sequence  data  were  compared  to  the  optical 
mapping  data,  and  good  correspondence  was  found.  Such  high-resolution  restriction  maps  may  become  an 
indispensable  resource  for  large-scale  genome  sequencing  projects. 


Optical  mapping  is  a  system  for  the  construction  of 
ordered  restriction  maps  from  single  molecules 
(Schwartz  et  al.  1993;  Anantharaman  et  al.  1997).  In¬ 
dividual  DNA  molecules  bound  to  derivatized  glass  sur¬ 
faces  and  cleaved  with  restriction  enzymes  are  imaged 
by  digital  fluorescence  microscopy.  Resulting  cut  sites 
are  visualized  as  gaps  between  cleaved  DNA  fragments, 
which  retain  their  original  order  (Cai  et  al.  1995, 1998). 
Optical  mapping  has  been  used  to  prepare  maps  of  a 
number  of  large  insert  clone  types  such  as  bacterial 
artificial  chromosomes  (Cai  et  al.  1998)  and  most  re¬ 
cently  genomic  DNA  0.  Lin,  R.  Qi,  C.  Aston,  J.  Jing,  T.S. 
Anantharam,  B.  Mishra,  D.  White,  J.C.  Venter,  and 
D.C.  Schwartz,  in  prep).  A  shotgun  mapping  strategy 
was  developed  in  parallel  for  several  microorganisms 
using  large  fragments  of  randomly  sheared  DNA  that 
were  mapped  with  high  cutting  efficiencies.  The  nu¬ 
merous  overlapping  restriction  site  landmarks  and  a 
measurable  cutting  efficiency  combined  together  to 
enable  accurate  contig  assembly  without  the  use  of 
cloned  DNA  (Anathraman  et  al.  1998).  Because  library 
construction  was  obviated,  it  was  possible  to  map  large 
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Plasmodium  falciparum  (P.  falciparum)  DNA  fragments, 
which  are  AT-rich  and  notoriously  difficult  to  clone 
because  of  deletion  and  rearrangement  in  Escherchia 
coli  (Gardner  et  al.  1998).  Because  cloning  artifacts 
were  precluded,  this  enabled  accurate  maps  to  be  gen¬ 
erated.  Furthermore,  small  amounts  of  starting  mate¬ 
rial  were  used,  facilitating  the  mapping  of  this  and  po¬ 
tentially  other  parasites  that  are  problematic  to  culture 
or  clone. 

Sequencing  of  chromosome  2  of  P.  falciparum  was 
completed  recently  by  Gardner  and  colleagues  (Gard¬ 
ner  et  al.  1998),  as  part  of  an  international  consortium 
sequencing  the  whole  P.  falciparum  genome  (Foster 
and  Thompson  1995;  Dame  et  al.  1996).  Existing 
physical  maps  of  P.  falciparum  chromosomes  [chromo¬ 
some  3;  (Thompson  and  Cowman  1997)  and  chromo¬ 
some  4  (Sinnis  and  Wellems  1988;  Watanabe  and  In- 
selberg,  1994)],  prepared  by  restriction  digestion,  gel 
fingerprinting,  and  hybridization  of  probes  are  of  mod¬ 
erate  resolution  and  not  ideally  suited  for  systematic 
sequence  verification.  To  assess  the  feasibility  of  opti¬ 
cally  mapping  a  whole  eukaryotic  chromosome,  we 
constructed  high-resolution,  ordered  restriction  maps 
of  P.  falciparum  chromosome  2  using  genomic  DNA 
and  later  compared  these  maps  with  those  generated  in 
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silico  from  the  sequence  data.  Such  restriction  maps 
reveal  the  architecture  of  large  spans  of  the  genome 
and  have  value  in  shotgun  sequencing  efforts  because 
they  provide  ideal  scaffolds  for  sequence  assembly,  fin¬ 
ishing,  and  verification.  Gaps  that  form  between  con- 
tigs  can  be  characterized  in  terms  of  location  and 
breadth,  thereby  facilitating  closure  techniques. 

RESULTS 

P.  falciparum  Chromosome  2  DNA  Sample 
A  chromosome  2  gel  slice  was  used  as  starting  material. 
Despite  the  AT-rich  nature  of  the  P.  falciparum  genome 
(80-85%),  melting  of  low-gelling-temperature  agarose 
inserts  did  not  affect  the  integrity  of  the  DNA  and  the 
chromosomal  DNA  was  competent  for  optical  map¬ 
ping.  Previously,  we  mounted  DNA  molecules  by  sand¬ 
wiching  the  sample  between  an  optical  mapping  sur¬ 
face  and  a  microscope  slide,  followed  by  peeling  the 
surface  from  the  slide.  DNA  molecules  were  stretched 
and  fixed  onto  the  surface.  This  method  works  very 
well  with  clone  types  such  as  bacteriophage,  cosmid, 
and  BAC  (Cai  et  al.  1995,  1998);  however,  larger  ge¬ 
nomic  DNA  molecules  tend  to  form  crossed  molecules. 
We  improved  this  approach  by  adding  the  sample  to 
the  edge  formed  by  the  placement  of  a  surface  onto  a 
slide.  The  liquid  DNA  sample  spreads  into  the  space 
between  the  surface  and  the  slide  by  capillary  action. 
Consequently,  DNA  breakage  was  minimized,  mol¬ 
ecules  tended  to  elongate  in  the  same  direction,  and 
crossed  molecules  were  also  minimized  (see  Fig.  1). 

Nhel  and  Bam  HI  Maps  for  P.  falciparum  Chromosome  2 

The  genomic  DNA  was  mapped  with  either  Nhel  (Fig. 
1A)  or  BamHl  (Fig.  IB).  Fragment  sizes  were  calculated 
by  comparison  with  comounted  X  bacteriophage  DNA 
(48.5  kb).  P.  falciparum  DNA  has  an  AT  content  of  80- 
85%  and  X  bacteriophage  DNA  has  an  AT  content  of 
50%.  The  YOYO-1  fluorochrome  used  for  DNA  staining 


intercalates  preferentially  between  GC  pairs  with  in¬ 
creased  emission  quantum  yield  (Netzel  et  al.  1995).  A 
correction  factor  was  therefore  applied  to  each  frag¬ 
ment  size  to  correct  for  this  massively  different  fluoro¬ 
chrome  incorporation.  X  bacteriophage  DNA  was  used 
also  to  determine  areas  on  the  surface  where  cutting 
efficiency  was  highest.  Cutting  efficiencies  were  >  80%. 
Maps  were  obtained  from  individual  molecules  of  —350 
kb.  Consensus  maps  were  assembled  from  50  mol¬ 
ecules  generating  an  average  contig  depth  of  15  mol¬ 
ecules.  Chromosome  2  was  found  to  be  976  kb  by  op¬ 
tical  mapping  with  Nhel,  and  946  kb  by  optical  map¬ 
ping  with  BamHl  (average  size  961  kb).  There  were  40 
fragments  in  the  Nhel  map,  ranging  from  1.5-115  kb, 
with  average  fragment  size  24  kb  (Fig.  2).  There  were 
30  fragments  in  the  BamHl  map  ranging  from  0.5-80 
kb,  with  average  fragment  size  32  kb  (Fig.  2).  Each 
fragment  size  in  the  consensus  map  was  averaged  from 
10  to  15  fragments.  Although  P.  falciparum  chromo¬ 
some  2  migrates  as  a  distinct  band  by  PFGE,  we  found 
the  gel  slice  to  contain  only  60%  chromosome  2-spe- 
cific  DNA.  The  remaining  optical  mapping  data  was 
rejected. 

Integration  of  Optical  Maps  and  Sequence  Data 
The  chromosome  2  sequence  assembled  by  Gardner 
and  colleagues  shows  chromosome  2  to  be  947  kb 
(Gardner  et  al.  1998)  versus  976  kb  by  optical  mapping 
with  Nhel  and  946  kb  with  BamHl.  The  optical  restric¬ 
tion  maps  were  compared  to  restriction  maps  predicted 
from  the  sequence,  and  there  was  very  good  correspon¬ 
dence  between  the  two,  indicating  that  there  were  no 
major  rearrangements  or  errors  in  the  assembled  se¬ 
quence  (Table  1).  The  optical  map  included  all  frag¬ 
ments  above  500  bp  predicted  from  sequence.  The 
overall  agreement  between  these  maps  and  the  se¬ 
quence  was  therefore  excellent,  with  the  average  frag¬ 
ment  size  difference  below  600  bp  (relative  error  4.3%) 
for  the  Nhel  map.  The  average  fragment  size  difference 
for  the  BamHl  map  was  1.2  kb 
(relative  error  5.8%).  However, 
there  were  several  notable  differ¬ 
ences.  Large  differences  in  size 
for  the  fragments  at  each  end  of 
the  chromosome  were  noted 
(Tables  1  and  2).  This  is  because 
the  sequence  for  these  subtelo- 
meric  regions  is  still  under  con¬ 
struction.  PCR  products  span¬ 
ning  subtelomeric  gaps  are  being 
sequenced  currently.  The  optical 
map  sizes  were  larger  than  those 
predicted  from  sequence  for  cer¬ 
tain  other  fragments  (Tables  1 
and  2).  These  differences  were 
due  to  large  fluorescence  inten- 
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Igure  1  Typical  P.  falciparum  chromosome  2  molecules  and  their  corresponding  optical 
naps.  (A)  digested  with  Nhe I  (B)  digested  with  BamHl.  Maps  derived  from  the  two  BamHI- 
ligested  molecules  in  (B)  can  be  aligned. 
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Figure  2  High-resolution  optical  mapping  of  P.  falciparum  chromosome  2  using  Nhe I  (4)  and 
BamH\  ( B ).  The  underlying  contig  used  to  generate  the  consensus  map  is  shown.  The  map 
predicted  from  sequence  information  is  shown  for  comparison. 


sity  measurements  falsely  caused  by  crossed  molecules. 
Currently,  we  combine  length  measurements  with 
fluorescence  intensity  measurements  to  improve  on 
our  sizing  of  these  fragments.  Chromosome  2  maps 
using  these  new  measurements  show  no  exceptional 
errors  (not  shown;  Jing  et  al.,  in  prep).  The  map  was 
used  to  facilitate  sequence  verification.  Optical  maps 
can  also  be  used  at  the  earlier  sequence-assembly  stage 
to  form  a  scaffold  for  assembly  of  contigs  formed  from 
sequencing.  Linking  of  single-enzyme  maps  produces  a 
much  higher  resolution  multi-enzyme  map  that  is  rich 
in  information.  Smaller  contigs  can  be  placed  confi¬ 
dently  on  a  multi-enzyme  map.  Nowadays,  mapping  is 


rarely  done  in  the  absence  of  se¬ 
quencing.  Figure  3  shows  a  com¬ 
parison  of  a  multi-enzyme  map 
generated  by  optical  mapping 
with  that  predicted  from  se¬ 
quence.  The  maps  are  in  com¬ 
plete  agreement  across  the  whole 
length  of  the  chromosome. 
Given  even  small  amounts  of  se¬ 
quence  (-100  kb),  maps  can  be 
linked  and  verified  readily. 

Map  Confirmation 
by  Southern  Blotting 

To  confirm  the  optical  maps  in¬ 
dependently  of  sequence  data, 
pulsed-field  gels  of  total  P.  falci¬ 
parum  DNA  digested  with  Nhel  or 
BamHl  were  run  and  blotted. 
Plasmid  clones  used  as  sequenc¬ 
ing  templates  provided  the 
probes  to  analyze  the  Southern 
blots.  Restriction  fragment  sizes 
of  the  blots  closely  compared  in 
size  to  the  fragments  determined 
by  optical  mapping  and  those 
predicted  from  the  preliminary 
sequence.  Probe  PF2CM93  hy¬ 
bridized  to  a  7.5  kb  band  gener¬ 
ated  by  Nhe  I  digestion  and  PFGE. 
The  fragment  size  predicted  from 
sequence  information  was  7.6 
kb.  The  corresponding  fragment 
size  from  the  optical  map  was 
also  7.6  kb  (Table  1).  The  same 
probe  hybridized  to  a  41 -kb  band 
generated  by  BamHl  digestion 
and  PFGE.  The  fragment  size  pre¬ 
dicted  from  sequence  informa¬ 
tion  was  41.3  kb.  The  corre¬ 
sponding  fragment  size  from  the 
optical  map  was  40.8  kb  (Table 
2).  Probe  PF2NA66  also  gener¬ 
ated  data  with  fragment  sizes  that  were  very  similar 
(Tables  1  and  2).  By  using  the  same  probe  on  DNA 
digested  with  the  two  different  enzymes,  the  optical 
maps  were  oriented  and  linked  with  one  another. 

DISCUSSION 

We  have  generated  a  high  resolution  Nhe  I  and  the 
BamHl  optical  restriction  map  of  P.  falciparum  chromo¬ 
some  2,  which  was  used  in  sequence  verification.  De¬ 
spite  the  fact  that  chromosome  2  is  resolved  easily  by 
PFGE,  we  found  the  chromosome  2  gel  slices  to  contain 
only  60%  chromosome  2-specific  DNA.  The  balance 
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Table  1,  Comparison  of  Nhe\  Optica!  Map 
with  Restriction  Map  Predicted  from  Sequence 


Optical 

map 

(kb) 

Map 

predicted 

from 

sequence 

(kb) 

Difference 

(kb) 

Relative 

difference 

(%) 

Hybridizing 

probe 

71.8 

66.597 

5.24 

0.6 

114.5 

115.147 

0.63 

10.3 

10.226 

0.02 

0.2 

3.4 

3.359 

0.07 

2.1 

7.9 

7.856 

0.05 

0.6 

24.7 

23.684 

1.03 

4.4 

6.8 

4.933 

1.88 

38.0 

16.5 

14.553 

1.97 

13.6 

32 

2.875 

0.177 

0.30 

10.3 

11.5 

11.425 

0.10 

0.9 

4.1 

3.768 

0.30 

7.9 

63.8 

63.252 

0.50 

0.8 

10.0 

10.018 

0.01 

0.1 

6.7 

6.431 

0.27 

4.2 

8.9 

9.248 

0.31 

3.3 

28.7 

27.327 

1.34 

4.9 

4.3 

4.357 

0.07 

1.6 

7.6 

7.581 

0.01 

0.01 

PF2CM93 

11.0 

10.588 

0.44 

4.2 

60.5 

60.324 

0.21 

0.4 

12.3 

11.935 

0.40 

3.3 

4.1 

3.964 

0.12 

3.0 

58.2 

57.925 

0.25 

0.4 

5.5 

5.381 

0.363 

0.07 

1.3 

1.6 

1.546 

0.02 

1.5 

23.4 

22.405 

0.96 

4.3 

35.1 

34.171 

0.91 

2.6 

18.1 

17.156 

0.93 

5.4 

3.1 

2.947 

0.16 

5.4 

24.9 

25.138 

0.28 

1.1 

40.8 

40.107 

0.73 

1.8 

20.8 

20.1 76 

0.59 

2.9 

25.1 

24.476 

0.62 

2.5 

77.3 

75.1 72 

2.15 

2.9 

PF2NA66 

16.6 

16,637 

0.07 

0.4 

48.0 

45.683 

2.30 

5.0 

9.4 

8.546 

0.88 

10.3 

20.1 

18.986 

1.15 

6.0 

23.9 

32.1 

23.192 

14.897 

0.75 

5.65 

3.2 

976.5 

934.513 

0.60 

4.3 

was  contaminated  with  DNA  molecules  from  other 
chromosomes.  Consequently,  a  portion  of  the  optical 
mapping  data  was  rejected.  Should  we  have  mapped 
other  chromosomes  using  the  same  strategy  we  could 
not  predict  the  acquisition  of  concise  data  from  chro¬ 
mosomes,  which  are  less  resolvable  by  PFGE,  such  as 
chromosomes  5-9. 

To  check  the  fidelity  of  the  optical  maps  indepen¬ 
dently,  Southern  blotting  of  chromosome  2  DNA  was 
performed.  Sequenced  small-insert  clones  were  used  as 
probes,  enabling  the  optical  maps  to  be  cross-checked 
against  the  sequence.  In  all,  the  optical  maps  were  veri¬ 


fied  against  sequence  data  and  Southern  blot  analysis, 
and  were  found  to  be  very  accurate.  A  more  directed 
operation  would  be  to  use  sequence-templates  as 
probes  for  hybridizations  to  generate  a  series  of  an¬ 
chors  for  sequence  assembly.  Such  templates  would  be 
placed  precisely  onto  the  optical  map,  in  terms  of 
physical  distance  (kb)  and  would  be  critical  for  finish¬ 
ing  genomic  regions  of  high  complexity;  namely,  tan¬ 
dem  or  inverted  repeats  of  high  homology  and  short 
sequence  length.  This  approach  would  also  readily  as¬ 
semble  data  acquired  using  different  techniques  and 
would  allow  the  placement  of  very  short  sequence  con- 
tigs  onto  a  map.  For  example,  STS  markers  or  ESTs 
could  be  assigned  to  restriction  fragments  on  a  whole 
genome  optical  map. 

Optical  maps  of  entire  chromosomes  should  also 
find  utility  at  the  sequence-assembly  stage  in  which 
numerous  large  contigs  are  formed,  but  have  unknown 
order  along  a  chromosome.  Traditional  approaches  to 
establish  contig  order  rely,  in  part,  on  combinatorial 
PCR,  or  sequence  alignment  with  physical  landmarks, 


Table  2.  Comparison  of  Bam  HI  Optica!  Map 
with  Restriction  Map  Predicted  from  Sequence 


Optical 

map 

(kb) 

Map 

predicted 

from 

sequence 

.  (kb)  , 

Difference 

(kb) 

Relative 

difference 

(%) 

Hybridizing 

probe 

77.1 

76.648 

0.42 

19.9 

20.955 

1.07 

5.09 

7.5 

6.81 

0.65 

9.52 

26.1 

27.054 

0.95 

3.52 

9.9 

9.831 

0.11 

1.15 

41.0 

43.295 

2.28 

5.26 

1  vv;'.:  ’ 

12,4 

13.647 

1.22 

8,92 

3.7  , 

3.754  : 

0.02 

0.67 

V--. 

34.8 

35.985 

1.18 

3.28 

2l.1 

*v;20.22 

0.91 

4.51 

63.6 

"  61.785 

1.80 

2.92 

55.9 

55.217 

0.73 

1.32 

41.3 

40.788 

0.50 

1.22 

PF2CM93 

673 

.  70318 

3.05 

433 

46.7 

46.943 

0.23 

0.49 

81.2 

87327 

6.14 

7.03 

2.0 

1.786 

0.20 

1135 

8.9 

11.633 

2.68 

23.07 

18.6 

17.953 

0.69 

3.85 

80.8 

83.96 

3.16 

3.77 

19.9 

20.665 

0.78 

3.76 

31.1 

30351 

0.72 

2.39 

17.4 

17/959 

0.56 

3.10 

28.6 

30.812 

2.22 

7.21 

PF2NA66 

52.2 

49.95 

2.26 

4.52 

2.0 

1.813 

0.18 

9.70 

24.9 

24,79 

0.07 

0.28 

6.0 

5315 

0.65 

12.28 

0.5 

0.621 

0.12 

19.48 

34.8 

16.346 

6.93 

937.2 

934331 

1.25 

5.86 
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which  are  usually  well  defined  in  terms  of  order  but 
not  physical  distance.  This  is  where  optical  maps  can 
streamline  the  final  assembly  process  by  reducing  the 
required  number  of  PCR  reactions,  by  providing  an  eas¬ 
ily  interpretable  physical  scaffold  with  which  sequence 
contigs  can  be  aligned.  The  alignment  process  is  to 
simply  generate  restriction  maps  in  silico  from  the  se¬ 
quence  data  and  compare  this  with  the  optical  maps. 
When  multiple  enzymes  are  used  independently  and 
resulting  maps  are  aligned  properly,  the  composite 
map  decreases  the  size  of  the  sequence  contig  neces¬ 
sary  for  confident  alignment  to  the  final  scaffold. 

The  information  content  of  a  multiple  restriction 
enzyme  map  is  greater  than  the  sum  of  its  parts  (Lander 
and  Waterman  1988).  We  used  the  sequence  data  to 
align  the  Nhe I  and  BamHl  restriction  maps  with  respect 
to  each  other,  creating  a  composite  map.  We  expected 
to  find  a  number  of  restriction  site  reversals  in  this 
composite.  That  is,  given  our  sizing  errors,  closely 
spaced  fragments  in  the  composite  map  may  not  be 
represented  in  the  correct  order,  and  would  possibly 
shift  relative  position.  To  our  initial  astonishment,  we 
found  only  one  instance  of  reversal.  Given  this  result, 
we  decided  to  evaluate  its  statistical  significance. 

One  way  to  evaluate  the  quality  of  a  composite 
enzyme  map  is  to  examine  how  well  it  preserves  the 
order  of  the  restriction  sites.  For  instance,  if  we  create 
two  maps,  one  with  a  restriction  enzyme  A  and  the 
other  with  the  restriction  enzyme  B,  and  combine  the 
two  maps  in  correct  order,  it  is  still  possible  that  the 
sizing  error  in  the  individual  fragments  may  create  a 
situation,  in  which  a  restriction  site  of  type  B  appears 
before  A,  whereas  the  correct  order  (in  the  sequence)  is 
A  followed  by  B —  restriction  sites  shift.  Assume  that 
both  enzymes  cut  at  the  same  rate  E,  and  the  genome 
(or  chromosome)  length  is  L.  Then  the  total  number  of 
fragments  of  each  type  is  N  =  LE.  If  the  sizing  error  in  a 
fragment  is  cr  (for  instance  1  kb),  then  the  maximum 
sizing  error  occurs  in  the  middle  of  the  map  and  is 
bounded  by  (VN/2)ct  (a  rather  conservative  estimate). 


Thus,  a  fragment  of  length  /,  and  cuts  of  type  A  in  one 
end  and  of  type  B  in  the  other  end,  may  appear  in  the 
computed  map  as  a  fragment  whose  length  is  a  random 
variable  with  mean  /  and  standard  deviation 
a'  *  (VN)a.  Thus  the  probability  that  this  fragment  will 
appear  in  the  reversed  order  is  bounded  by  <£(//ct'), 
where 

Furthermore,  the  length  of  the  fragment  with  cuts  A 
and  B  is  distributed  as  2 Ee~2El.  Thus,  a  random  frag¬ 
ment  of  this  kind  has  a  length  longer  than  a'  with 
probability  e~2Ea'  and  a  simple  estimate  shows  that  the 
probability  of  reversal  is  bounded  by 

(1  -  e”2£or>(0)  +  e-2£CT'<t>(l) 

Consider  the  following  values  of  the  parameters 
L  =  980  kb,  E  =  1/30,000,  a  =  1  kb.  For  these  values, 
a'  =  5.7  kb  and  the  average  fragment  length  (with  two 
enzymes)  is  15  kb.  The  above  estimate  indicates  that 
the  probability  of  reversal  is  bounded  by  0.27.  A  some¬ 
what  better  estimate  can  lower  this  value  to  0.17.  As 
the  expected  number  of  fragments  with  cuts  A  follow¬ 
ing  B  (or  B  following  A)  is  ~30,  one  would  expect  to  see 
fewer  than  five  reversals. 

However,  the  composite  map  created  by  optical 
mapping  has  only  one  reversal.  The  probability  of  this 
situation  (with  fewer  than  1  reversal)  occurring  is  ~1  in 
40.  More  exactly,  this  probability  is  (1  -  p)30  +  30  p 
(1  -  p)29  =  0.023.  This  difference  may  signal  the  re¬ 
quirement  for  more  sophisticated  analysis,  or  indicates 
the  presence  of  a  potentially  useful  physical  effect.  A 
closer  examination  of  the  data  reveals  that  the  error  in 
the  fragment  sizes  in  the  composite  map  has  a  normal 
distribution  with  mean,  0.02  kb  and  standard  devia¬ 
tion,  2.01  kb.  Surprisingly,  the  error  in  the  cut  loca¬ 
tions  has  a  mean,  - 1.78  kb  and  a  standard  deviation, 
1.82  kb,  indicative  of  the  presence  of  systematic  (e.g., 
sequence-specific)  error  and 
much  smaller  unsystematic  er¬ 
ror.  A  recalculation  of  the  ex¬ 
pected  number  of  reversals  with 
the  observed  values  (cr'  =  1.82 
kb)  results  in  slightly  more  than 
two  reversals,  making  the  ob¬ 
served  number  of  reversals  of 
only  one  much  more  likely  (-1  in 
7  as  opposed  to  1  in  40).  Note 
that  as  our  estimate  of  a'  is  for 
the  worst-case  situation,  we  be¬ 
lieve  a  more  realistic  analysis 
may  close  the  gap.  On  the  other 
hand,  this  may  be  caused  by  an¬ 
other  biochemical  effect  that  we 
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Figure  3  The  use  of  sequence  information  to  link  single  enzyme  maps.  The  top  map  was 
generated  by  normalizing  the  single  enzyme  maps  to  be  the  same  size  (961  kb).  The  resulting 
multienzyme  map  was  aligned  with  the  map  predicted  from  sequence.  The  median  relative 
error  is  7%.  The  average  absolute  error  is  1 .4  kb.  Upper  tick  marks  are  Nhe\  sites;  lower  tick 
marks  are  BamHl  sites. 
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do  not  account  for  in  our  analysis.  More  experiments 
and  analyses  are  required  to  resolve  this  situation. 

Current  optical  mapping  studies  of  P.  falciparum 
use  whole  genomic  DNA  as  starting  material.  The  chro¬ 
mosomes  are  resolved  at  the  level  of  data  rather  than  as 
physical  entities.  The  data  segregates  into  14  deep  con- 
tigs  corresponding  to  the  various  chromosomes.  Chro¬ 
mosome  2  can  be  resolved  based  on  size  and  the  near 
complete  correspondence  with  the  data  shown  in  this 
paper  (one  600-bp  BamHl  fragment  is  missing  on  the 
whole  genome  map).  The  success  of  this  project  has 
prompted  the  Malaria  Genome  Consortium  to  recom¬ 
mend  support  of  whole  genome  mapping  to  assist  in 
closure  of  chromosomes,  as  well  as  for  verification  of 
the  final  assembly. 

In  summary,  we  describe  the  construction  of  an 
ordered  restriction  map  of  P.  falciparum  chromosome  2 
using  optical  mapping  of  genomic  DNA.  A  combined 
approach  using  shotgun  sequencing  and  optical  map¬ 
ping  will  facilitate  sequence  assembly  and  finishing  of 
large  and  complex  genomes. 

METHODS 

Parasite  Preparation 

P.  falciparum  (clone  3D7)  was  cultivated  using  standard  tech¬ 
niques  (Trager  and  Jensen  1976).  To  minimize  possible  alter¬ 
ations  of  the  genome  that  can  occur  in  continuous  culture 
(Corcoran  et  ai.  1986),  parasite  aliquots  were  kept  frozen  in 
liquid  N2  until  needed  and  then  cultivated  only  as  long  as 
necessary.  Parasites  were  cultivated  to  late  trophozoite/early 
schizont  stages  and  enriched  on  a  Plasmagel  gradient.  The 
parasitized  red  blood  cells  were  washed  once  with  several  vol¬ 
umes  of  10  mM  Tris  (pH  8),  0.85%  NaCl  and  the  parasites  were 
freed  from  the  erythrocytes  by  incubation  in  ice-cold  0.5% 
acetic  acid  in  dHzO  for  5  min,  followed  by  several  washes  in 
cold  buffer.  The  parasites  were  resuspended  to  a  concentra¬ 
tion  of  2  x  109/mi  in  buffer  and  maintained  in  a  50°C  water- 
bath.  An  equal  volume  of  1%  InCert  agarose  (FMC,  Rockland, 
ME)  in  buffer,  prewarmed  to  50°C,  was  mixed  with  the  pre¬ 
warmed  parasites  and  the  mixture  was  added  to  a  1  x  1  x  10- 
cm  gel  mold,  plugged  at  one  end  with  solidified  agarose,  and 
was  allowed  to  cool  to  4°C.  The  agarose-embedded  parasites 
were  pushed  out  of  the  mold  and  incubated  with  50  ml  of 
proteinase  K  solution  (2  mg/ml  proteinase  K  in  1%  Sarkosyl, 
0.5  M  EDTA)  at  50°C  for  48  hr  with  one  change  of  proteinase 
K  solution  and  were  stored  in  50  mM  EDTA  at  4°C  (Schwartz 
and  Cantor  1984). 

Chromosome  2  Isolation  by  PFGE 

Uniform  parasite  slices  were  taken  with  a  glass  coverslip  using 
two  offset  microscope  slides  as  guides.  One  half  to  one  quarter 
of  a  single  slice  was  sufficient  per  lane.  Parasite  slices  were 
arranged  end  to  end  on  the  flat  side  of  the  gel  comb.  The 
parasites  were  fixed  to  the  comb  by  a  small  bead  of  molten 
(60°C)  agarose.  The  comb  was  then  placed  into  the  gel  mold 
and  molten  agarose  [1.2%  SeaPlaque  (FMC)  in  0.5  x  TBE] 
poured  around  the  parasite-containing  slices.  Once  cooled, 
the  comb  was  removed  and  the  space  filled  with  molten  aga¬ 
rose.  A  CHEF  DRIII  apparatus  (Bio-Rad,  Hercules,  CA)  was 


used  for  all  PFGE  (Schwartz  and  Cantor  1984)  chromosome 
separations.  Gels  were  run  with  180-250  sec  of  ramped  pulse 
time  at  3.7  V/cm  and  120°  field  angle,  for  90  hr  at  14°C  with 
recirculating  buffer  at  ~1 1/min,  using  Saccharomyces  cerevisiae 
and/or  Hansenula  wingei  PFGE  size  markers  (Bio-Rad).  To 
minimize  UV  damage  to  the  DNA,  gel  slices  were  removed 
from  the  ends  of  the  gel,  stained  with  ethidium  bromide  (5 
pg/ml),  and  visualized  by  long  wave  (320  nm)  UV  light. 
Notches  corresponding  to  the  individual  chromosomes  were 
made  in  the  agarose  gel  and  used  as  guides  to  cut  the  chro¬ 
mosome  from  the  gel.  The  chromosome-containing  gel  slices 
were  stored  in  50  mM  EDTA  at  4°C  until  needed.  The  gel  was 
stained  with  ethidium  bromide  to  verify  the  chromosome  ex¬ 
cision.  The  genome  of  P.  falciparum  is  26-30  Mb  in  size,  con¬ 
sisting  of  14  chromosomes  ranging  in  size  from  0.6-3. 5  Mb 
(Foote  and  Kemp  1989).  PFGE  resolves  most  of  the  P.  falcipa¬ 
rum  chromosomes,  except  5-9  which  are  similar  sizes  and 
comigrate.  The  gel  band  containing  Plasmodium  falciparum 
chromosome  2  was  resolved  easily,  cut  from  the  gel,  melted  at 
72°C  for  7  min  and  incubated  with  agarose  at  40°C  for  2  hr. 
The  melted  agarose  band  was  diluted  in  TE  to  a  final  DNA 
concentration  suitable  for  optical  mapping  (-20  pg/pl. 

Mounting  and  Digestion  of  DNA  on  Optical 
Mapping  Surface 

Optical  mapping  surfaces  were  prepared  as  described  previ¬ 
ously  (Aston  et  al.  1999).  Briefly,  glass  coverslips  (18  x  18 
mm2;  FISHER  Finest,  Pittsburgh,  PA)  were  cleaned  by  boiling 
in  concentrated  nitric,  then  hydrochloric  acid.  Surfaces  were 
derivatized  with  3-aminopropyldiethoxymethyl  silane  (AP- 
DEMS;  Aldrich  Chemical,  Milwaukee,  WI).  One  surface  was 
placed  onto  a  microscope  slide.  A  DNA  sample  (10  pi)  was 
added  to  the  edge  between  the  surface  and  the  slide  and 
spread  into  the  space  between  the  surface  and  the  slide.  The 
surface  was  then  peeled  off  from  the  slide.  Digestion  was  per¬ 
formed  by  adding  100  pi  of  digestion  solution  [50  mM  NaCl, 
10  mM  Tris-HCl  (pH  7.9),  10  mM  MgCl2,  0.02%  Triton  X-100, 
20  units  of  restriction  endonuclease;  New  England  Biolabs, 
Beverly,  MA]  onto  the  surface  and  incubating  at  37°C  from  15 
to  30  min.  The  buffer  was  aspirated  and  the  surface  washed 
with  water  before  staining  of  DNA  with  YOYO-1  homodimer 
(Molecular  Probes,  Eugene,  OR),  prior  to  fluorescence  micros¬ 
copy.  Comounted  X  bacteriophage  DNA  (New  England  Bio¬ 
labs)  was  used  as  a  sizing  standard  and  also  to  estimate  cutting 
efficiencies. 

Image  Acquisition,  Processing,  and  Map  Construction 

DNA  molecules  were  imaged  by  digital  fluorescence  micros¬ 
copy.  The  optical  mapping  surface  was  scanned  by  the  opera¬ 
tor  for  individual  digested  DNA  molecules  of  adequate  length 
and  quality  to  be  collected  for  image  processing  and  map 
making.  Images  were  collected  with  a  cooled  charge  coupled 
device  (CCD)  camera  (Princeton  Instruments,  Trenton,  NJ) 
using  Optical  Map  Maker  (OMM)  software,  as  described  pre¬ 
viously  (Jing  et  al.  1998).  Images  of  DNA  fragments  were  pro¬ 
cessed  using  a  modified  version  of  NIH  Image  (Huff  1996) 
which  integrates  fluorescence  intensity  for  each  fragment. 
These  values  were  used  to  assemble  an  ordered  restriction  map 
for  each  molecule.  Fluorescence  intensity  of  X  bacteriophage 
DNA  standards  was  used  to  measure  the  size  of  the  P.  falcipa¬ 
rum  restriction  fragments  on  a  per  image  basis.  Cutting  effi- 
ciences  (on  a  per  image  basis)  were  determined  from  scoring 
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cut  sites  on  sizing  standard  molecules  contained  in  the  same 
field  as  the  genomic  DNA  molecules.  Standard  molecules  were 
cut  once  by  Nhel  and  five  times  by  BamHl.  The  map  for  the 
entire  chromosome  2  was  manually  assembled  into  contigs  by 
aligning  overlapping  regions  of  congruent  cut  sites.  If  there 
were  no  overlapping  regions,  the  molecules  were  considered 
to  be  from  a  contaminating  P.  falciparum  chromosome  and 
were  discarded.  Consensus  maps  for  chromosome  2  were  as¬ 
sembled  by  averaging  the  fragment  sizes  from  the  individual 
maps  derived  from  maps  underlying  the  contigs. 

Southern  Blotting  of  P.  falciparum  Genomic  DNA 

P.  falciparum  genomic  DNA  (10  pg)  was  digested  with  Nhel  or 
BamHl,  resolved  by  PFGE  (POE  apparatus,  1%  gel  in  0.5  X 
TBE,  pulse  time,  1  sec,  2  sec;  switch  time,  12  sec,  150  V,  for  24 
hr)  (Schwartz  and  Koval  1989),  blotted,  and  hybridized  with 
probes  derived  from  small  insert  clones  used  for  sequencing 
(PF2CM93  and  PF2NA66).  Probes  were  labeled  by  random 
priming. 
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The  unicellular  parasite  Plasmodium  falciparum  is  the  cause  of 
human  malaria,  resulting  in  1. 7-2.5  million  deaths  each  year1. 
To  develop  new  means  to  treat  or  prevent  malaria,  the  Malaria 
Genome  Consortium  was  formed  to  sequence  and  annotate  the 
entire  24.6-Mb  genome2.  The  plan,  already  underway,  is  to 
sequence  libraries  created  from  chromosomal  DNA  separated 
by  pulsed-field  gel  electrophoresis  (PFGE).  The  AT-rich  genome 
of  P.  falciparum  presents  problems  in  terms  of  reliable  library 
construction  and  the  relative  paucity  of  dense  physical  markers 
or  extensive  genetic  resources.  To  deal  with  these  problems,  we 
reasoned  that  a  high-resolution,  ordered  restriction  map  cover¬ 
ing  the  entire  genome  could  serve  as  a  scaffold  for  the  align¬ 
ment  and  verification  of  sequence  contigs  developed  by 
members  of  the  consortium.  Thus  optical  mapping  was 
advanced  to  use  simply  extracted,  unfractionated  genomic 
DNA  as  its  principal  substrate.  Ordered  restriction  maps  (HamHI 
and  Nhe\)  derived  from  single  molecules  were  assembled  into 
14  deep  contigs  corresponding  to  the  molecular  karyotype 
determined  by  PFGE  (ref.  3). 

Optical  mapping  is  now  a  proven  means  for  the  construction  of 
accurate,  ordered  restriction  maps  from  ensembles  of  individual 
DNA  molecules  derived  from  a  variety  of  clone  types,  including 
bacterial  artificial  chromosomes4  (BACs),  yeast  artificial  chro¬ 
mosomes5  ( YACs)  and  small  insert  clones6.  We  previously  devel¬ 
oped  approaches  for  mapping  clone  DNA  samples  that  relied  on 
the  analysis  of  large  numbers  of  identical  DNA  molecules.  Here, 
the  challenge  was  to  develop  ways  to  generate  restriction  maps  of 
a  population  of  randomly  sheared  DNA  molecules  directly 
extracted  from  cells  that  were  obviously  non-identical.  Problems 
to  be  solved  included  the  development  of  techniques  for  mount¬ 
ing  very  large  DNA  molecules  onto  surfaces  and  new  methods  for 
accurately  mapping  individual  molecules,  which  were  uniquely 
represented  within  a  population.  Finally,  new  algorithms  were 
necessary  to  assemble  such  maps  into  gap-free  contigs  covering 
all  14  chromosomes  of  the  P.  falciparum  genome. 

We  developed  an  optical  mapping  approach,  termed  shotgun 
optical  mapping,  that  used  large  (250-3,000  kb),  randomly 
sheared  genomic  DNA  molecules  as  the  substrate  for  map  con¬ 
struction  (Fig.  1  a-e).  Random  fragmentation  of  genomic  DNA 
occurred  naturally  as  a  consequence  of  careful  pipetting  and 
other  manipulations.  Surface-mounted  molecules  were  digested 
using  BamHl  and  Nhel  (refs  6-8).  Because  genomic  DNA  mole¬ 
cules  frequently  extended  through  multiple  digital  image  fields, 
we  developed  an  automated  image  acquisition  system  (GenCol) 
to  overlap  digital  images  with  proper  registration  (Figs  lc  and  2). 
Map  construction  techniques  were  altered  to  take  into  account 
local  restriction  endonuclease  efficiencies  (the  rate  of  partial 
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Fig.  1  Schematic  of  shotgun  optical  mapping  approach,  a.  Shotgun  optical 
mapping  used  large  (250-3,000  kb),  randomly  sheared  genomic  DNA  molecules 
as  the  substrate  for  map  construction,  b,  Random  fragmentation  of  genomic 
DNA  occurred  naturally  as  a  consequence  of  careful  pipetting  and  other 
manipulations.  Surface-mounted  molecules  were  digested  using  BamHl  and 
Nhel  (ref.  8).  c.  Because  genomic  DNA  molecules  frequently  extended  through 
multiple  image  fields,  an  automated  image  acquisition  system  was  developed 
(GenCol)  and  used  to  overlap  images  with  proper  registration,  d,  Map  con¬ 
struction  techniques  take  into  account  local  restriction  endonuclease  efficien¬ 
cies  (the  rate  of  partial  digestion)  and  the  analysis  of  molecule  populations 
that  differed  in  composition  and  mass,  e.  These  steps  were  necessary  to  enable 
accurate  construction  of  map  contigs. 


1  W.M.  Keck  Laboratory  for  Biomolecular  Imaging,  Department  of  Chemistry  and  2Courant  Institute  of  Mathematical  Sciences,  Department  of  Computer 
Science,  New  York  University,  Department  of  Chemistry,  New  York,  New  York,  USA.  3 Malaria  Program,  Naval  Medical  Research  Center  and  4The  Institute 
for  Genomic  Research,  Rockville,  Maryland,  USA.  5 Present  address:  University  of  Wisconsin-Madison,  Departments  of  Chemistry  and  Genetics,  UW 
Biotechnology  Center,  Madison,  Wisconsin,  USA.  Correspondence  should  be  addressed  to  D.C.S.  (e-mail:  schwadOWmcrcr.med.nyu.edu). 
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Fig.  2  Digital  fluorescence  micro¬ 
graph  and  map  of  a  typical  genomic 
DNA  molecule.  A  P.  falciparum  mole¬ 
cule  digested  with  Nhe I  is  shown 
with  its  corresponding  optical  map. 
Comparison  with  the  consensus 
optical  map  shows  this  molecule  to 
be  an  intact  chromosome  3.  Image 
composed  by  tiling  a  series  of  63x 
(objective  power)  images  using  Gen- 
Col.  Co-mounted  \  bacteriophage 
DNA  is  used  as  a  sizing  standard  and 
to  estimate  cutting  efficiencies. 


digestion)  and  the  analysis  of  molecule  populations  that  differed 
in  composition  and  mass.  These  steps  were  necessary  to  enable 
accurate  construction  of  map  contigs. 

Previous  map  construction  techniques  using  cloned  DNA 
molecules5’6,9  determined  restriction-fragment  mass  on  the  basis 
of  relative  measures  of  integrated  fluorescence  intensities  or 
apparent  lengths.  Thus,  fragment  masses  were  reported  as  a  frac¬ 
tion  of  the  total  clone  size  (1.0),  and  later  converted  to  kilobases 
by  independent  measure  of  clone  masses  (that  is,  cloning  vector 
sequence10).  Additionally,  maps  derived  from  ensembles  of  iden¬ 
tical  molecules  were  averaged  to  construct  final  maps.  In  con¬ 
trast,  here,  we  independently  sized  restriction  fragments  in 
genomic  shotgun  optical  mapping  using  X  bacteriophage  DNA 
that  was  co-mounted  and  digested  in  parallel  (Fig.  2).  These  mol¬ 
ecules  were  also  used  to  locally  monitor  the  restriction  digestion 
efficiency,  and  to  infer  the  extent  of  digestion  on  a  per  molecule 
(genomic)  basis.  Cutting  efficiencies  were  in  excess  of  80%.  This 
assessment  provided  a  critical  set  of  parameters  for  the  contig 
assembly  program,  ‘Gentig’8’11,12,  to  reliably  overlap  maps 
derived  from  individual  DNA  molecules. 

Gentig  assembled  maps  into  a  number  of  deep  contigs,  but  did 
not  assign  every  single-molecule  map  to  a  contig.  The  program 


assembled  contigs  using  50%  of  the  available  molecules,  which 
corresponded  to  70%  of  the  total  mass  of  the  molecules.  In  other 
words,  the  program  was  better  able  to  construct  contigs  from  the 
longer  single-molecule  maps.  Finishing  work  using  spreadsheets 
assembled  the  data  into  14  contigs  corresponding  to  the  PFGE- 
generated  molecular  karyotype,  with  a  total  genome  size  of  24.16 
Mb  (Table  1).  BamHl  and  Nhel  maps  had  an  average  fragment 
size  of  30.6  kb  and  30.1  kb,  respectively.  We  constructed  consen¬ 
sus  maps  (Fig.  3)  by  simple  averaging  of  aligned  restriction-frag¬ 
ment  masses  (typically  6-26  fragments)  derived  from 
overlapping  DNA  molecules.  Overall,  chromosome  sizes  were 
largely  consistent  with  PFGE  results,  with  the  total  optical 
genome  size  being  approximately  7%  smaller,  indicating  that  no 
previously  uncharacterized  nuclear  component  was  found. 

We  previously  constructed  a  high -resolution  optical  map  of 
P.  falciparum  chromosome  2  (ref.  7).  The  starting  material  was  a 
PFGE  gel  slice  containing  fractionated  chromosome  2  DNA.  We 
now  constructed  a  whole-genome  optical  map  using  total,  unfrac¬ 
tionated  genomic  DNA  as  the  starting  material  and  resolved  all  14 
chromosomes,  including  electrophoretically  unseparable  ones 
(chromosomes  5-9,  termed  the  ‘blob’),  at  the  level  of  data  (optical 
map  contigs)  rather  than  as  physical  entities  (that  is,  gel  bands). 


Table  1  •  ft  falciparum  whole-genome  optical  mapping 


Chr. 

PFGE 

Nhe  1 

BamHl 

Ave. 

Diff. 

Linkage/ 

(Mb) 

(Mb) 

(Mb) 

(Mb) 

(Mb) 

confirmation 

Orientation 

1 

0.65/0.65* 

0.684 

0.668 

0.676 

0.016 

1,3 

+ 

2 

1.0/0.947* 

0.958 

1.037 

0.997 

0.079 

1,3 

+ 

3 

1.2/1.060* 

1.084 

1.096 

1.090 

0.012 

1,2 

+ 

4 

1.4 

1.311 

1.306 

1.309 

0.005 

1 

5 

1.6 

1.331 

1.337 

1.334 

0.006 

6 

1.6 

1.395 

1.373 

1.384 

0.022 

7 

1.7 

1.494 

1.444 

1.469 

0.050 

8 

1.7 

1.495 

1.504 

1.499 

0.009 

9 

1.8 

1.600 

1.595 

1.598 

0.005 

10 

2.1 

1.808 

1.688 

1.748 

0.120 

1 

11 

2.3 

2.097 

2.089 

2.093 

0.008 

1 

12 

2.4 

2.478 

2.361 

2.419 

0.117 

1 

13 

3.2 

3.172 

3.022 

3.097 

0.150 

2 

+ 

14 

3.4 

3.436 

3.404 

3.420 

0.032 

1,3 

+ 

Total 

26.05 

24.341 

23.974 

24.157 

0.367 

♦Size  from  sequencing.  Linkage/confirmation  was  obtained  as  follows:  by  mapping  PFGE-purified  chromosomal  material  (1);  by  mapping  chromosome-specific 
YACs  (2);  or  by  sequence  information  (3).  +,  BamHl  and  Nhe  I  maps  have  been  oriented.  Chr.,  chromosome;  Ave.,  average  size;  Diff.,  difference  between  BamHl 
and  Nhe\  maps.  _ 
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Fig.  3  High-resolution  optical  map¬ 
ping  of  the  ft  falciparum  genome 
using  Nhel  and  BamHI.  We 
mapped  944  molecules  with  Nhe\; 
the  average  molecule  length  was 
588  kb,  corresponding  to  23x  cov¬ 
erage.  We  mapped  1,116  mole¬ 
cules  with  BamHI;  the  average 
molecule  length  was  666  kb,  corre¬ 
sponding  to  31x  coverage,  a,  Gap- 
free,  consensus  Nhe I  and  BamHI 
maps  were  generated  across  all  14 
ft  falciparum  chromosomes  using 
the  map  contig  assembly  program 
Gentig.  b,c,  Nhe  I  and  BamHI  map 
alignments  determined  by  Gentig, 
displayed  by  ConVEx.  Fragment 
sizes  of  consensus  maps  (blue  lines) 
shown  in  (a)  were  determined 
from  the  alignment  and  averaging 
of  maps  derived  from  6-26  under¬ 
lying  individual  molecules  (green 
lines),  230-2,716  kb.  d,  Enlarge¬ 
ment  of  contig  for  chromosome  3 
(Nhel)  shown  in  ConVEx  displays 
maps  (green)  scaled  to  the  consen¬ 
sus  map  (blue).  These  data  can  be 
accessed  at  http^/carbon. biotech. 
wisc.edu/plasmodium.  Bar  lengths 
reflect  measured  fragment  sizes. 
Fragments  that  overlap  are  shaded. 
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To  assess  errors  produced  by  shotgun  optical  mapping,  we 
compared  optical  restriction  maps  for  chromosome  2  with 
restriction  maps  generated  in  silico  using  a  previously  assembled 
sequence13.  We  found  good  correspondence  between  the  two 
maps.  The  sequence  shows  chromosome  2  to  be  947  kb  versus 
958  kb  by  optical  mapping  with  Nhe  I  and  1,037  kb  with  BamHI . 
Only  one  600-bp  BamHI  fragment  was  missing  in  the  entire 
genome  optical  map.  The  Nhel  optical  map  included  all  frag¬ 
ments  above  400  bp  predicted  from  sequence.  The  average 
absolute  relative  error  in  sizing  fragments  was  4.6%  for  Nhel  and 
5.0%  for  BamHL  Likewise,  similar  errors  for  chromosome  3  were 
determined  by  comparing  optical  maps  with  sequence  data 
(Nhe I,  4.4%;  BamHI ,  4.1%;  total  optical  size  versus  sequence, 
Nhel,  1,084  kb;  BamHI,  1,147  kb;  versus  1,060  kb;  D.  Lawson, 


pers.  comm.).  These  sizing  errors  were  similar  to  those  associated 
with  PFGE. 

Some  large  Nhel  and  BamHI  fragments  were  noticeable  at  the 
telomeric  ends.  A  telomere  of  one  of  the  ‘blob’  chromosomes 
(chromosome  7)  is  composed  of  three  consecutive  6-kb  BamHI 
fragments.  Optical  mapping  can  estimate  numbers  of  repetitive 
regions  if  the  repeats  contain  recognition  sites  for  the  endonucle¬ 
ase  used.  Subtelomeric  regions  in  P.  falciparum,  however,  are 
characterized  by  21 -bp  tandem  repeats14,  which  are  too  small  to 
be  detected  by  optical  mapping. 

We  used  several  approaches  to  verify  and  to  link  our  optical 
maps  with  the  PFGE  molecular  karyotypes,  which  number  chro¬ 
mosomes  according  to  mobility.  Chromosomes  that  were  identi¬ 
fied  and  the  orientations  of  BamHI  and  Nhel  maps  are  shown 
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Fig.  4  Identification  of  chromosomes 
and  alignment  of  Nhe\  and  BamHI 
maps  by  mapping  chromosome-spe¬ 
cific  YAC  clones.  Chromosome  3  and 
13-specific  YAC  maps  were  aligned 
with  the  optical  maps  and  the  two 
enzyme  maps  were  then  oriented  and 
linked.  Each  YAC  is -150  kb. 


(Table  1).  We  confirmed  chromosomal  identities  of  some  optical 
maps  by  optical  mapping  of  PFGE-purified  chromosomal  DNA 
(ref.  7)  with  Miel  or  BamHI.  Here,  most  maps  formed  a  contig, 
which  aligned  with  a  specific  consensus  map.  Despite  the  fact  that 
the  largest  and  smallest  P.  falciparum  chromosomes  are  resolved 
by  PFGE,  the  gel  slices  contained  DNA  molecules  from  other 
chromosomes.  There  was,  however,  a  sufficiently  large  population 
of  molecules  that  formed  a  contig  with  a  particular  chromosome 
(>50%)  to  be  able  to  identify  it  as  being  the  chromosome  pre¬ 
dicted  from  the  molecular  karyotype.  When  many  chromosomes 
are  similar  in  size,  such  as  chromosomes  5-9,  there  are  many  pos¬ 
sible  orientations  of  the  maps,  thus  this  approach  was  not  viable. 
Chromosome-specific  YAC  clones  were  also  optically  mapped  for 
further  confirmation  of  chromosomal  orientation  and  linkage. 
We  aligned  the  resulting  maps  with  a  specific  contig  in  the  consen¬ 
sus  maps  (Fig.  4).  YAC  clones  were  not  available  for  those  chro¬ 
mosomes  in  the  ‘blob’,  so  we  were  unable  to  identify  or  link  these 
optical  maps.  As  such,  we  have  assigned  numbers  to  these  chro¬ 
mosomes  according  to  their  optically  determined  masses 
(Table  1).  Maps  can  also  be  linked  together  by  a  series  of  double 
digestions,  by  the  use  of  available  sequence  information  or  by 
Southern  blot  using  chromosome-specific  probes. 

Because  unicellular  parasites  have  relatively  small  chromo¬ 
somes  that  do  not  visibly  condense,  PFGE  has  provided  a  means 
by  which  chromosomal  entities  can  be  physically  mapped  and 
studied  at  the  molecular  level15-17.  In  fact,  PFGE  separations  are 
currently  providing  the  very  material  that  the  international 
malaria  consortium  is  using  to  create  chromosomal-specific 
libraries  for  large-scale  sequencing  efforts  (http://www-ermm. 
cbcu.cam.ac.uk/dcn/txtOO  1  dcn.htm) .  Unfortunately,  parasites 
such  as  P.  falciparum  can  have  karyotypically  complex  genomes, 
which  confound  PFGE  analysis  by  displaying  similarly  sized 
chromosomes.  Furthermore,  very  large  or  circular  chromosomes 
are  difficult  to  physically  identify  or  characterize18.  Although  the 
shotgun  sequencing  of  entire  microorganism  genomes19’20  has 
obviated  physical  mapping  to  some  extent,  high-quality,  finished 
sequence  remains  laborious  to  generate. 

Many  issues  regarding  the  efficient  sequencing  of  lower  eukary¬ 
otes  remain  to  be  fully  resolved,  especially  when  available  map 
resources  are  minimal.  In  the  case  of  Saccharomyces  cerevisiae ,  the 


entire  genome  was  sequenced  by  a  large  consortium  of  laborato¬ 
ries  on  a  per  chromosome  basis21.  Their  tasks  were  facilitated  by 
the  availability  of  extensive  physical  and  genetic  maps,  plus  an 
assortment  of  well-characterized  libraries.  These  substantial 
genome  resources  provided  ample  means  for  the  needed  sequence 
verification  efforts,  and  aids  for  the  sequence-assembly  process.  In 
a  similar,  though  much  less  distributive  fashion,  the  Caenorhabdi- 
tis  elegans  genome  was  recently  completely  sequenced22.  Given  the 
rapid  pace  of  electrophoretic  sequencing  technology23,24  and  the 
accumulation  of  resources  in  sequence  acquisition  and  analysis, 
new  ways  to  efficiently  sequence  lower  eukaryotes,  particularly 
those  implicated  in  human  disease,  must  be  developed  to  opti¬ 
mally  leverage  map  resources  created  by  optical  mapping. 

The  optical  maps  presented  here  have  been  used  by  members 
of  the  consortium13,25  as  scaffolds  to  verify  and  facilitate 
sequence  assemblies.  In  general,  the  maps  were  integrated  into 
the  sequence  assembly  process,  in  much  the  same  way  as  any 
other  physical  maps.  In  particular,  our  maps  have  provided  reli¬ 
able  landmarks  for  sequence  assembly  where  traditional  maps 
are  somewhat  sparse.  Compared  with  sequence-tagged  site 
(STS)  or  EST  maps,  in  which  landmark  order  is  known  but 
physical  distance  is  approximate,  optical  restriction  maps  are 
constructed  from  landmarks  (restriction  sites)  that  are  precisely 
characterized  by  physical  distance.  Another  advantage  is  the 
speed  of  map  construction:  the  maps  presented  here  required 
only  4-6  months  to  generate.  Given  these  and  other  advantages, 
future  work  will  center  on  the  algorithmic  integration  of  high- 
resolution  optical  maps  with  primary  sequence  reads  to  more 
fully  automate  the  sequence  assembly  and  verification  process. 
Finally,  we  plan  to  use  optical  mapping  as  the  basis  for  develop¬ 
ing  of  new  ways  to  study  genomic  variations  that  fall  between,  or 
outside  of,  the  capabilities  of  sequence-based  approaches  and 
cytogenetic  observation. 

Methods 

Parasite  preparation.  We  cultivated  P.  falciparum  (clone  3D7)  in  erythro¬ 
cytes  using  standard  techniques26.  Possible  alterations  of  the  genome  that 
can  occur  in  continuous  culture27  were  minimized  by  keeping  parasite 
aliquots  frozen  in  liquid  N2  until  needed.  We  then  cultivated  parasites  only  as 
long  as  necessary  and  prepared  agarose-embedded  parasites  as  described7. 
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Mounting  and  digestion  of  DNA  on  optical  mapping  surfaces.  We  pre¬ 
pared  derivatized  glass  optical  mapping  surfaces  as  described7’28.  We  dilut¬ 
ed  genomic  DNA  in  TE  buffer  containing  a  sizing  standard  ( X  bacterio¬ 
phage  DNA,  50  ng/ml),  which  was  co-mounted  with  the  genomic  DNA  by 
spreading  the  sample  into  the  space  between  the  surface  and  a  microscope 
slide.  DNA  molecules  were  digested  with  Nhe I  or  BamHI  (ref.  8).  X  bacte¬ 
riophage  DNA  (48.5  kb;  New  England  Biolabs)  is  cut  once  by  Nhel.  X 
DASH  II  bacteriophage  DNA  (41.9  kb;  Stratagene)  is  cut  twice  by  BamHl. 
Therefore,  we  also  used  standards  to  identify  regions  on  the  surface  where 
the  digestion  efficiency  exceeded  70%.  We  stained  DNA  with  YOYO-1 
homodimer  (Molecular  Probes),  before  fluorescence  microscopy.  P.  falci¬ 
parum  DNA  has  an  AT  content  of  80-85%,  and  X  bacteriophage  DNA  has 
an  AT  content  of  50%.  The  YOYO-1  fluorochrome  used  for  DNA  staining 
preferentially  intercalates  between  GC  pairs  with  increased  emission  quan¬ 
tum  yield29.  We  therefore  applied  a  correction  factor  to  each  fragment  size 
to  correct  for  this  variation  in  fluorochrome  incorporation. 

Image  acquisition,  processing  and  map  construction.  We  collected  digital 
images  of  DNA  molecules  with  a  cooled  charge  coupled  device  (CCD)  cam¬ 
era  (Princeton  Instruments)  using  Optical  Map  Maker  (OMM)  software  as 
described6.  Because  genomic  DNA  molecules  span  multiple  microscope 
image  fields,  we  developed  ‘GenCol’  an  image  acquisition  and  management 
software  that  was  used  to  automatically  collect  and  overlap  consecutive 
CCD  images  with  proper  pixel  registration.  GenCol  used  a  precise  fitting 
routine,  and  the  resulting  ‘super-images’  covered  the  entire  length  of  single 
DNA  molecules,  spanning  several  microscope  fields.  Restriction  fragments 
were  marked  up  with  ‘Visionade’28,  a  semi-automatic  visualization/editing 
program,  which  was  run  on  super-images.  Files  created  from  marked-up 
images  of  molecules  were  then  sent  to  map  construction  software,  which 
automatically  determined  the  restriction  fragment  masses,  characterized 
internal  DNA  standard  molecules  and  produced  finished  maps  from  single 
genomic  molecules.  The  integrated  fluorescence  intensities  of  X  bacterio¬ 
phage  DNA  standards,  co-mounted  with  the  genomic  molecules,  were  used 
to  measure  the  size  of  the  P.  falciparum  restriction  fragments  on  a  per  image 
basis.  Cutting  efficiencies  (on  a  per  image  basis)  were  determined  from 
scoring  cut  sites  on  sizing  standard  molecules  contained  in  the  same  field  as 
the  genomic  DNA  molecules.  Knowledge  of  endonuclease  cutting  efficien¬ 
cies  was  critical  for  accurate  contig  construction. 

Contig  assembly  by  Gentig.  Sophisticated  statistical  methods  are  used  to 
overcome  errors  associated  with  partial  digestion  and  mass  determina¬ 


tion11,12.  Gentig  finds  overlapped  molecules  and  assembles  them  into  con- 
tigs.  It  computes  contigs  of  genomic  maps  using  a  heuristic  algorithm  for 
finding  the  best  scoring  set  of  contigs  (overlapping  maps),  because  finding 
the  optimal  placement  is  in  general  computationally  too  expensive.  The 
entire  P.  falciparum  genome  data  set  can  be  assembled  into  contigs  in  ~20 
min.  Gentig  assembled  consensus  maps  for  each  chromosome  by  averaging 
the  fragment  sizes  from  the  individual  maps  underlying  the  contigs. 

Contig  viewing  and  editing  by  ‘ConVEx’.  We  viewed  contigs  using 
‘ConVEx’  (contig  visualization  and  exploration  tool).  ConVEx  is  a  mul¬ 
ti-scale  zoomable  interface  for  visualization  and  exploration  of  large, 
high-resolution  contiged  restriction  maps.  Users  can  examine  the  con¬ 
sensus  maps  together  with  the  raw  uncorrected  data.  ConVEx  also  has  a 
‘lens’  mechanism  that  provides  annotation  and  editing  features,  allow¬ 
ing  communication  of  features  such  as  STS  markers,  and  even  the 
underlying  sequence  reads. 

Chromosome  isolation  by  PFGE.  The  genome  of  P  falciparum  is  -25  Mb, 
consisting  of  14  chromosomes  ranging  from  0.6  to  3.5  Mb  (ref.  28).  PFGE 
resolves  most  of  the  P.  falciparum  chromosomes,  except  5-9,  which  are  of 
similar  sizes  and  co-migrate.  PFGE-purified  chromosomal  DNA  was  pre¬ 
pared  as  described8  and  used  as  a  substrate  for  optical  mapping. 

YAC  isolation  and  mapping.  We  cultured  yeast  cells  in  AHC  media  and 
prepared  agarose-embedded  cells  using  standard  methods^.  We  purified 
YAC  DNA  using  PFGE  (POE  apparatus,  1%  gel  in  O.SxTBE,  pulse  time  3  s, 
5  s;  switch  time  32  s;  150  volts  for  24  h;  ref.  30).  Optical  maps  of  YAC  clones 
were  prepared  with  Nhe  I  and  BamHl  as  described  above. 
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ICOS  Receives  $15M  Milestone  BioCryst  Brings  In  $50.5M 
But  Also  Has  Two  Failed  Trials  Through  Public  Offering 


By  Mary  Welch 
Staff  Writer 

ICOS  Corp.  received  a  $15  million  milestone  for  starting 
a  Phase  III  study  of  Its  treatment  of  erectile  dysfunction,  but 
the  company  also  said  that  two  other  drugs  in  Phase  II  tri¬ 
als  failed  to  achieve  statistical  significance  In  their  end¬ 
points. 

ICOS,  of  Bothell,  Wash.,  and  partner  Eli  Lilly  &  Co.,  of 
Indianapolis,  will  take  IC35I  Into  Phase  III  trials  to  test  the 
phosphodiesterase  type  5  (PDES)  Inhibitor  as  an  oral  treat¬ 
ment  for  sexual  dysfunction  in  men  and  women. 

“We  have  quite  a  number  of  compounds  for  a  number 
of  Indications  In  the  clinic,"  said  Cary  Peterman.  ICOS’ 
senior  director  of  therapeutic  development.  “We  under¬ 
stand  that  not  all  of  them  would  be  a  success  in  all  indica¬ 
tions.  We  saw  the  result  of  that  today.  But  we  are  confident 
of  our  long-term  strategy  and  very  pleased  with  what’s 
happening  with  IC351  and  the  $15  million  milestone. 

See  ICOS,  Page  4 


QjjisQrtium  Jakes,  On  Plasmodium  Falciparum 

Sequencing  of  Malarial  Parasite 
Genome  Gets  Cutting-Edge  Boost 
From  Optical  Mapping  Technique 

By  David  N.  Leff 
Science  Editor 

When  a  lone  serial  killer  goes  to  ground,  law-enforce¬ 
ment  elements  organize  a  search  party  made  up  of  local 
cops  from  the  crime  scene,  county  sheriffs,  state  constabu¬ 
lary,  U.S.  marshals  and  the  FBI.  Once  they  hunt  down  and 
catch  the  suspect,  forensic  DNA  tests  help  nail  down  the 
culprit’s  Identity,  and  evidence  linking  him  to  his  victims. 

Now  a  multinational  posse  of  genomldsts  Is  hot  on  the 
trail  of  the  world’s  largest-scale  killer,  most  of  whose  vic¬ 
tims  are  children.  Their  quarry  Is  the  mosquito-borne  para¬ 
site,  Plasmodium  falclparum.whidn  takes  the  lives  of  some 
2  million  people  a  year,  most  of  them  in  the  world’s  tropical 
areas. 

See  Genome,  Page  5 


By  Lisa  Seachrlst 
Washington  Editor 

With  competition  increasing  in  the  market  for  flu 
drugs,  small-molecule  specialist  BioCryst  Pharmaceuticals 
Inc.  raised  $50.5  million  in  a  public  offering  to  fund  devel¬ 
opment  of  its  flu  pill  and  other  clinical  and  predlnical  pro¬ 
grams. 

The  Birmingham,  Ala.-based  company  sold  2  million 
shares  at  a  price  of  $25.25  each,  exceeding  its  expected 
offering  revenues  of  $45.5  million  based  on  a  share  price 
of  $2  5.  The  money  will  be  used  to  advance  its  lead  flu  drug, 
RWJ-2 70201,  formerly  known  as  BCX-1812,  as  well  as  devel¬ 
opment  programs  In  T-cell  related  diseases  and  purine 
nucleoside  phosphorylase  (PNP)  inhibitors. 

The  underwriters  for  the  offering -Salomon  Smith  Bar¬ 
ney  Inc.,  of  New  York;  Hambrecht  &  Qu  1st  LLC,  of  San  Fran¬ 
cisco;  and  Raymond  James  &  Associates  Inc.,  of  New  York- 
were  granted  an  option  to  purchase  an  additional  300,000 

See  BioCryst,  Page  3 


Versicor  Raises  $40  Million 
To  Develop  Lead  Compounds 

By  Mary  Welch 
Staff  Writer 

Versicor  Inc.  closed  a  $40  million  round  of  private 
equity  financing  aimed  at  funding  Its  trials  of  LY30336,  an 
antifungal  class  of  drugs,  and  Bl  397,  Its  second-generation 
glycopeptlde. 

“This  Is  significant  funding,"  said  George  Horner,  presi¬ 
dent  and  CEO  of  the  Fremont,  Calif.-based  company.  ‘We 
wanted  to  raise  $36  million,  so  this  gives  us  a  lot  of  flexi¬ 
bility.  It  will  take  us  through  until  we  are  able  to  go  public." 

Horner  declined  to  speculate  when  the  company, 
which  spun  off  from  Sepracor  Inc.  in  1995,  would  enter  the 
public  market. 

Horner  also  refused  to  say  how  many  shares  of  stock 
exchanged  hands  or  what  percentage  of  the  company’s 
equity  was  Involved.  “We’re  a  private  company  and  I  Just 

See  Versicor,  Page  3 


Bill  To  Restore  Patent  Term  Sees  Movement  In  Senate . 

Exelixis,  Pharmacia  &  Upjohn  Expand  Alliance  Into  New  Area 
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While  other  task  forces  have  been  striving  for  decades 
to  develop  anti-malarial  drugs  and  vaccines,  the  U.S.-UK 
Malaria  Genome  Consortium  came  into  existence  a  few 
years  ago  to  sequence  P.  falciparum’s  entire  24.6 
megabase  genome. 

“The  consortium, ".said  genomidst  David  Schwartz,  at 
the  University  of  Wisconsin-Madison,  “is  funded  by  the  U.S. 
National  Institutes  of  Health,  and  Britain's  Wellcome  Trust 
plus  the  Burroughs-Wellcome  Fund.  The  major  sequencing 
efforts’  he  added,  include  labs  from  Stanford  University; 
TIGR  (The  Institute  for  Genomic  Research,  In  Rockville, 
Md.);  and  the  UK's  Sanger  Institute.  Those  people  have 
divvied  up  the  parasite’s  14  chromosomes,"  Schwartz 
observed. 

The  consortium's  goals,"  he  pointed  out, “are  first  com¬ 
pletely  analyzing  the  genome,  in  terms  of  getting  its 
sequence.  That  should  take  about  a  year  from  now,  maybe 
a  little  bit  longer.  Then,  secondly,  making  sense  of  that 
sequence." 

Schwartz  himself  Is  making  some  of  that  sense  at  a 
pre-sequencing  stage  of  the  consortium’s  progress.  He  is 
senior  author  of  an  article  in  the  November  1999  issue  of 
Nature  Genetics,  titled,  “A  shotgun  optical  map  of  the  entire 
Plasmodium  falciparum  genome." 

“Optical  mapping,”  Schwartz  explained,  “is  a  new  tech¬ 
nology,  which  colleagues  and  I  Invented  and  patented  10  or 
1 1  years  ago.  It  maps  an  organism’s  entire  genome  from  sin¬ 
gle  DNA  molecules,  provides  reliable  landmarks,  and  could 
ratchet  up  the  race  to  decipher  complete  genomes  -  from 
food  crops  to  human  beings. 

“One  can  picture  optical  mapping  as  an  entire  map  of 
the  United  States,"  he  suggested,  "whereas  conventional 
genome  sequencing  would  be  thousands  of  detailed  maps 
of  every  city  in  the  nation.  Optical  mapping  data  works  in 
concert  with  hlgh-resolution  DNA  sequence  data,  linking 
both  together  in  a  complete  and  seamless  description  of  a 
genome." 

Consortium  laboratories,  Schwartz  told  BioWorld 
Today, .'Are  already  incorporating  our  optical  scanning  of  P. 
falciparum  into  their  total-sequencing  modus  operandi. 
They  are  relying  on  us  to  do  the  optical  mapping,  and  they 
do  the  sequencing.” 

Parasite's  Map  -  From  Bottle  To  Data 

The  15  co-authors  of  his  Nature  Genetics  report  reflect 
the  assembly-line  procedure  of  optically  mapping  the  par¬ 
asite’s  genome. “At  the  U.S.  Naval  Medical  Research  Center’s 
malaria  program,"  Schwartz  recounted, "Daniel  Carucci  and 
his  colleagues  were  able  to  grow  the  single-cell  parasite, 
put  It  Into  a  bottle,  then  simply  extract  DNA  from  these  bot¬ 
tled  pathogens." 

Then  he  sent  the  DNA  to  us  -  just  long  strands,  mil¬ 
lions  of  bases  In  length.  We  simply  pinned  them  down 


on  glass  surfaces,  to  which  they  stick  through  electro¬ 
static  forces.  It’s  very  much  like  rubbing  a  balloon  on  a 
wool  sweater;  then  you’re  able  to  stick  it  to  a  wall.  Basi¬ 
cally,  the  DNA  molecules  stick  to  our  glass  surface,  and 
elongate. 

“Next,"  Schwartz  went  on,  “we  took  two  restriction 
enzymes,  and  cut  the  DNA  strands.  Wherever  an  enzyme 
recognizes  its  cognate  site,  it  cuts  the  molecule.  And  we 
can  see  that  it  cuts  because  that’s  where  a  gap  forms  -  vis¬ 
ible  enough  so  that  we  can  see  It  through  a  light  micro¬ 
scope.  We  have  software  that  automatically  goes  In  there 
and  finds  the  cleaved  fragments,  and  measures  their  mass, 
their  size,  according  to  how  much  fluorescence  Is  associ¬ 
ated  with  each  fragment. 

“When  you  have  an  ordered  restriction  map,"  he  went 
on,  “a  single  molecule  that’s  been  cut,  It  generates  a  series 
of  daughter  fragments,  which  constitute  a  single  ordered 
restriction  map.  Then  software  puts  together  many  such 
maps  that  have  overlaps  of  commonality  with  one  another, 
at  least  in  part -and  that  did  it.  We  then  had  a  physical  map 
of  an  entire  P.  falciparum  genome,  generated  without 
clones  or  PCR  or  electrophoresis." 

For  purposes  of  genome  sequencing,  Schwartz  pointed 
out,  “these  maps  serve  as  a  scaffold  to  tell  you  very  con¬ 
cisely  how  to  align  small  snippets  of  sequence  with  a 
whole  chromosome.  Also,  to  know  if  your  sequence  is  cor¬ 
rect,  this  is  a  way  of  error-checking  it." 

Anti-Malarial  Drugs,  Vaccines,  Therapies? 

“The  fact  that  optical  mapping  can  facilitate 
sequencing,"  Schwartz  pointed  out,  “and  be  sure  about 
it,  provides  the  stuff  that  people  are  going  to  be  looking 
at  to  develop  new  anti-malarial  therapies,  new  vac¬ 
cines,  new  drugs  and  so  on.  By  comparing  maps  of  hun¬ 
dreds  of  individual  human  genomes,  for  example,"  he 
added,  “scientists  could  pinpoint  the  origin  of  genetic 
diseases,  understand  the  complexities  of  trait  Inheri¬ 
tance,  examine  the  process  behind  DNA  repair.  This  is 
like  the  Periodic  Table.” 

Having  wrapped  up  P.  falciparum  -which  took  them 
five  months-heand  his  co-authors  are  now  tackling  the 
genome  of  Trypanosoma  brucel,  the  pathogen  of 
African  sleeping  sickness.  And  he  has  Just  received  a 
grant  to  take  on  the  genome  of  rice,  the  world’s  No.  1 
food  crop. 

“What  we’ve  been  in  the  process  of  doing  for  the 
past  three  or  four  years,"  Schwartz  said,  “is  trying  to 
harden  the  optical  mapping  system  so  that  it  will  have  a 
very  very  high  throughput,  and  be  very  cheap  to  do. 
Right  now,  we’re  looking  for  industrial  partners  to  do 
that,  because  this  sort  of  development  work  doesn't  go 
that  well  In  a  university  environment.  Originally  we  got 
a  lot  of  funding  from  Chiron  and  Novartis.  So  now.”  he 
concluded,  "we’re  thinking  about  trying  to  put  together 
our  own  company"! 


To  subset  it**.  (***-  call  Rk)W;)«i  if  Customer  St-ivkt*  al  (800)  688-2421.  oijInIcIh  1 1*  II.S.  awl  Canada,  call  (404)  262-5476 
Copyright  ®  1999  American  I  lealth  Consultants*.  Reproduction  is  strictly  prohibited.  Visit  our  web  site  at  www.bioworld.com. 


Parassitologia  41:  69-75,  1999 


The  malaria  genome  sequencing  project: 

complete  sequence  of  Plasmodium  falciparum  chromosome  2 

M.J.  Gardner1,  H.  Tettelin1,  DJ.  Carucci2,  L.M.  Cummings1,  H.O.  Smith1, 

C.M.  Fraser1,  J.C.  Venter1,  S.L.  Hoffman2 

1  The  Institute  for  Genomic  Research;  2  Malaria  Program ,  Naval  Medical  Research  Center ;  Rockville,  MD,  USA. 

Abstract.  An  international  consortium  has  been  formed  to  sequence  the  entire  genome  of  the  human 
malaria  parasite  Plasmodium  falciparum.  We  sequenced  chromosome  2  of  clone  3D7  using  a  shotgun 
sequencing  strategy.  Chromosome  2  is  947  kb  in  length,  has  a  base  composition  of  80.2%  A+T,  and  con¬ 
tains  210  predicted  genes.  In  comparison  to  the  Saccharomyces  cerevisiae  genome,  chromosome  2  has 
a  lower  gene  density,  a  greater  proportion  of  genes  containing  introns,  and  nearly  twice  as  many  proteins 
containing  predicted  non-globular  domains.  A  group  of  putative  surface  proteins  was  identified,  rifins, 
which  are  encoded  by  a  gene  family  comprising  up  to  7%  of  the  protein-encoding  genes  in  the  genome. 
The  rifins  exhibit  considerable  sequence  diversity  and  may  play  an  important  role  in  antigenic  variation. 
Sixteen  genes  encoded  on  chromosome  2  showed  signs  of  a  plastid  or  mitochondrial  origin,  including 
several  genes  involved  in  fatty  acid  biosynthesis.  Completion  of  the  chromosome  2  sequence  demon¬ 
strated  that  the  A+T-rich  genome  of  P.  falciparum  can  be  sequenced  by  the  shotgun  approach.  Within  2- 
3  years,  the  sequence  of  almost  all  P.  falciparum  genes  will  have  been  determined,  paving  the  way  for 
genetic,  biochemical,  and  immunological  research  aimed  at  developing  new  drugs  and  vaccines  against 
malaria. 

Key  words:  Plasmodium  falciparum,  malaria,  chromosome  2,  rifins,  genomics,  malaria  genome  sequenc¬ 
ing  project. 


In  1995,  the  first  complete  genome  sequence  of  a 
free-living  organism,  Haemophilus  influenzae ,  was 
published  (Fleischmann  et  al ,  1995).  The  publica¬ 
tion  of  the  H.  influenzae  genome  sequence  marked 
a  turning  point  in  biology.  As  noted  by  Bloom,  it 
heralded  a  post-genomics  era  of  microbe  biology 
when  the  complete  genomes  of  most  human 
pathogens  would  have  been  sequenced,  providing  a 
vast  database  of  sequence  information  that  would 
enable  researchers  to  focus  on  studies  of  the  biolo¬ 
gy  and  pathogenicity  of  these  organisms  (Bloom, 
1995).  This  research  in  turn  would  lead  to  the  de¬ 
velopment  of  new  drugs  and  vaccines  to  treat  and 
prevent  diseases  caused  by  these  pathogens,  and 
would  be  especially  useful  for  research  on  organ¬ 
isms  difficult  to  grow.  Since  then,  there  has  been  a 
flurry  of  effort  to  sequence  the  genomes  of  other 
pathogens,  and  the  genomes  of  organisms  that  cause 
diseases  such  as  syphylis  ( Treponema  pallidum ),  ul¬ 
cers  ( Helicobacter  pylori ),  Lyme  disease  ( Borrelia 
burgdorferi),  tuberculosis  ( Mycobacterium  tubercu¬ 
losis),  and  trachoma  ( Chlamydia  trachomatis)  have 
been  completed  (Fraser  et  al,  1997,  1998;  Tomb  et 
al,  1997;  Cole  et  al,  1998;  Stephens  et  al,  1998). 


Invited  contribution  to  the  Malariology  Centenary  Conference 
" The  malaria  challenge  after  one  hundred  years  of  malariol¬ 
ogy”  held  in  Rome  at  the  Accademia  Nazionale  dei  Lincei, 
16-19  November  1998. 

Correspondence:  Dr  Malcolm  J.  Gardner,  The  Institute  for 
Genomic  Research,  9712  Medical  Center  Drive,  Rockville, 
MD  20850,  USA,  Tel  ++1  301  8383519,  Fax  ++1  301 
8380208,  e-mail:  gardner@tigr.org 


The  genomes  of  several  microbes  of  environmental 
importance  have  also  been  sequenced,  as  has  the 
genome  of  the  yeast  Saccharomyces  cerevisiae  (see 
<http://www.tigr.org/tdb/mdb/mdb.html>  for  a 
complete  listing  of  microbial  genomes  that  have 
been  sequenced  or  that  are  in  progress).  There  is  no 
evidence,  so  far,  that  the  pace  of  sequencing  has 
slackened,  and  that  more  than  60  microbial  genomes 
is  currently  underway. 

The  completion  of  the  first  few  microbial  genomes 
caused  several  groups  to  contemplate  the  sequenc¬ 
ing  of  the  Plasmodium  falciparum  genome.  It  was 
realized  that  determination  of  the  complete  R  falci¬ 
parum  genome  sequence  would  be  of  great  value  to 
malariologists  given  the  difficulty  of  studying  this  or¬ 
ganism  in  the  laboratory,  with  large  parts  of  the  life 
cycle  being  difficult,  expensive,  or  impossible  to 
maintain  in  the  laboratory.  Furthermore,  techniques 
such  as  DNA  microarrays  and  transfection  had  been 
developed,  providing  researchers  with  new  tools  to 
study  the  expression  and  function  of  genes  and  gene 
products  in  malaria  parasites.  Several  groups  initiat¬ 
ed  pilot  sequencing  projects,  and  an  international 
consortium  including  malaria  researchers,  genome 
laboratories,  bioinformatics  centers,  and  funding 
agencies  was  formed  to  coordinate  the  project,  facil¬ 
itate  collaboration,  and  ensure  that  the  data  would 
be  provided  to  the  scientific  community  in  a  timely 
and  useful  manner  (Hoffman  et  al,  1997).  The  con¬ 
sortium  met  every  6  months  during  the  start-up 
phase  of  the  project  and  continues  to  meet  regularly 
as  the  work  proceeds. 

At  the  time  the  P.  falciparum  project  was  started, 
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several  prokaryotic  and  archaeal  genomes  had  been 
finished,  and  sequencing  of  the  genomes  of  yeast 
and  Caenorhabditis  elegans  were  nearing  comple¬ 
tion.  Two  strategies  had  been  used  in  these  projects. 
The  clone-by-clone  method,  used  to  sequence  the 
Escherichia  coli  and  S.  cerevisiae  genomes,  for  ex¬ 
ample,  involved  sequencing  of  large-insert  clones 
from  cosmid,  lambda,  and  YAC  libraries  (Blattner  et 
al.,  1997).  The  clones  sequenced  were  selected  after 
the  construction  of  a  physical  map,  which  provided 
a  tiling  path  of  overlapping  clones  spanning  the 
genome.  The  other  method,  pioneered  at  TIGR,  was 
the  whole  genome  shotgun  method,  which  used  a 
genomic  library  of  sheared  1-2  kb  fragments  pre¬ 
pared  in  a  plasmid  vector  (Fleischmann  et  al., 
1995).  Thousands  of  randomly  selected  small  insert 
clones  were  picked  and  sequenced,  and  custom  frag¬ 
ment  assembly  software  was  used  to  assemble  the 
overlapping  fragments  into  a  contiguous  sequence. 
This  method  proved  to  be  very  efficient  in  that  con¬ 
struction  of  a  physical  map  was  not  required  prior  to 
sequencing.  However,  very  robust  software  for  frag¬ 
ment  assembly  had  to  be  developed  that  was  able  to 
handle  many  thousands  of  individual  sequence  reads 
and  also  deal  with  the  repetitive  sequences  present 
in  bacterial  genomes.  In  addition,  relational  databas¬ 
es  and  software  were  developed  to  manage  the  gap 
closure,  finishing,  and  annotation  processes. 

Sequencing  of  the  P  falciparum  genome  raised 
some  formidable  technical  challenges,  however.  At 
~28  Mb,  the  P.  falciparum  genome  was  almost  20- 
fold  larger  than  the  H.  influenzae  genome  and 
seemed  too  large  to  tackle  by  the  whole  genome 
shotgun  method  because  of  the  computational  re¬ 
quirements  of  the  assembly  process.  Closure  of  the 
many  gaps  that  would  have  remained  after  the  initial 
assembly  would  also  have  been  difficult  with  such  a 
large  genome  and  few  sequence  markers  to  guide  the 
closure  process.  On  the  other  hand,  the  clone-by¬ 
clone  approach  was  ruled  out  because  large-insert 
(>20  kb)  genomic  libraries  of  very  AT-rich  P  falci¬ 


parum  DNA  in  plasmid,  lambda,  and  cosmid  vectors 
that  could  be  used  for  sequencing  were  not  avail¬ 
able.  Although  large-insert  yeast  artificial  chromo¬ 
some  (YAC)  libraries  of  P  falciparum  (Foster  and 
Thompson,  1995)  had  been  constructed  which  ap¬ 
peared  to  be  stable,  YACs  are  not  very  well  suited  to 
high-throughput  sequencing  projects.  Consequently, 
a  new  approach  was  adopted  in  which  individual 
chromosomes  were  resolved  on  pulsed-field  gels  and 
used  to  prepare  chromosome-specific  shotgun  li¬ 
braries  in  plasmid  and  Ml 3  vectors.  Randomly-se¬ 
lected  clones  were  then  sequenced  and  assembled  in 
the  same  way  as  for  a  whole-genome  shotgun  pro¬ 
ject.  Some  laboratories  also  performed  low-coverage 
sequencing  of  shotgun  libraries  prepared  from  YACs 
previously  mapped  on  the  chromosomes  (Foster  and 
Thompson,  1995);  the  YAC  shotgun  sequences 
helped  to  group  sequences  from  the  same  part  of  the 
chromosome  and  assisted  in  gap  closure.  Adoption 
of  the  chromosome-by-chromosome  shotgun  strate¬ 
gy  allowed  the  sequencing  effort  to  be  distributed 
among  the  different  sequencing  centers. 

Three  groups  are  involved  in  the  sequencing  effort: 
TIGR  and  the  Malaria  Program  of  the  US  Naval 
Medical  Research  Center  (NMRC);  the  Sanger  Cen¬ 
tre  in  the  UK;  and  Stanford  University.  The  current 
status  of  the  project  (as  of  July  1999)  is  summarized 
in  Table  1.  Once  the  problems  that  had  been  en¬ 
countered  in  library  construction,  sequencing,  assem¬ 
bly  and  gap  closure  were  solved,  all  3  groups  began 
to  make  rapid  progress.  The  complete  sequence  of 
chromosome  2  (0.95  Mb)  was  recently  published  by 
the  TIGR/NMRC  group  (Gardner  et  al,  1998),  and 
the  Sanger  Center  has  virtually  finished  chromosome 
3(1.1  Mb).  Work  on  the  other  chromosomes  is  well 
underway.  The  chromosome  2  sequence  was  submit¬ 
ted  to  GenBank  and  the  sequence  and  annotation  is 
available  at  TIGR’s  web  site  and  at  the  NCBI  (Table 
1).  Preliminary  unedited  sequence  data  is  also  avail¬ 
able  for  downloading,  browsing  or  searching  on  web 
sites  maintained  at  each  laboratory. 


Table  1.  Chromosome  assignments  and  current  status  of  the  Malaria  Genome  Sequencing  Project.  a  Estimated  chromoso¬ 
me  sizes  for  P.  falciparum  clone  3D7  were  taken  from  Dame  et  al.  (1996)  or  from  the  sequence  data. b  NIA1D,  National  In¬ 
stitute  for  Allergy  and  Infectious  Diseases:  DoD,  US  Department  of  Defense;  BWF,  Burroughs  Wellcome  Fund.  c  Complete 
annotation  (chromosome  2)  or  preliminary  data  can  be  viewed  at  web  sites  maintained  by  the  sequencing  centers-  TI¬ 
GR/NMRC  <http://www.tigr.org/tdb/mdb/pfdb/pfdb.htm!>;  the  Sanger  Centre  <http://www.sanger.ac.uk/Projects/PJalcipa- 
rum/>;  Stanford  University  <http://baggage.stanford.edu/group/malaria/start.html>. 


Cromosome(s)a 

Size  (Mb) 

Laboratory 

Funding6 

Status  (as  of  7/99)c 

1 

0.8 

Sanger  Centre 

Wellcome  Trust 

Closure 

2 

0.95 

TIGR/NMRC 

NIAID,  DoD 

Completed  (Gardner  et  al.,  1998) 

3 

1.1 

Sanger  Centre 

Wellcome  Trust 

Completed  (Bowman  et  a!.,  in  press) 

4 

1.4 

Sanger  Centre 

Wellcome  Trust 

Closure 

5-8 

1.6 

Sanger  Centre 

Wellcome  Trust 

Sequencing 

9 

1.8 

Sanger  Centre 

Wellcome  Trust 

Sequencing 

10 

2.1 

TIGR/NMRC 

NIAID,  DoD 

Sequencing 

11 

2.3 

TIGR/NMRC 

NIAID,  DoD 

Closure 

12 

2.5 

Stanford  University 

BWF 

Closure 

13 

3.2 

Sanger  Centre 

Wellcome  Trust 

Sequencing 

14 

3.4 

TIGR/NMRC 

BWF,  DoD 

Closure 
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Sequencing  of  the  first  P  falciparum  chromosome 

At  the  beginning  of  the  Malaria  Genome  Sequencing 
Project,  P.  falciparum  clone  3D7  was  chosen  for  se¬ 
quencing  because  it  can  complete  all  stages  of  the 
life  cycle,  was  used  in  a  genetic  cross  (Walliker  et 
ai,  1987),  and  had  been  used  in  the  Wellcome  Trust 
Malaria  Genome  Mapping  Project  (Foster  and 
Thompson,  1995).  The  TIGR/NMRC  group  began  a 
pilot  project  to  sequence  chromosome  2,  which  was 
selected  because  it  could  be  easily  resolved  on 
pulsed-field  gels,  and  being  about  1  Mb  in  size  it  was 
not  too  large  to  present  unsurmountable  difficulties 
in  assembly  or  gap  closure.  P.  falciparum  chromo¬ 
somes  were  resolved  on  preparative  pulsed-field  gels 
and  the  chromosome  2  bands  from  several  gels  were 
cut  out,  adjusted  to  0.3  M  sodium  acetate  to  prevent 
melting  of  the  AT-rich  DNA,  and  digested  with 
agarose.  The  DNA  was  sheared  by  nebulization  and 
a  shotgun  library  was  prepared  in  pUC18  as  de¬ 
scribed  (Fleischmann  et  al.y  1995)  except  that  treat¬ 
ment  with  E.  coli  DNA  polymerase  I  was  performed 
after  the  second  ligation  step  to  close  nicks  prior  to 
electroporation.  During  all  steps  of  the  library  con¬ 
struction  process,  the  exposure  of  the  DNA  to  UV 
light  was  minimized  to  avoid  damage  to  the  DNA 
that  would  reduce  the  cloning  efficiency,  particular¬ 
ly  of  the  very  AT-rich  intergenic  sequences.  In  addi¬ 
tion,  to  prevent  generation  of  non-randomness,  the 
library  was  not  amplified  prior  to  sequencing. 
Rather,  the  ligation  mixtures  were  stored  at  -20°C, 
and  as  needed  aliquots  were  electroporated  into 
DH10B  cells  and  spread  on  ampicillin  diffusion 
plates.  The  shotgun  library  contained  1X105  recom¬ 
binants  and  had  an  average  insert  size  of  1.6  kb. 

Initial  sequencing  was  done  with  dye-primer 
chemistry  used  previously  to  sequence  H.  influenzae 
and  the  other  microbial  genomes.  However,  when 
sequencing  the  P.  falciparum  clones  we  observed  an 
apparent  artifact  with  the  dye-primer  chemistry  that 
resulted  in  runs  of  G  nucleotide  base  calls  to  be  in¬ 
correctly  made  following  long  runs  of  AT-rich  se¬ 
quence.  The  artifact  did  not  occur  when  FS+  dye- 
terminator  chemistry  was  used  on  the  same  template 
DNAs,  and  the  dye-terminator  chemistry  also  pro¬ 
duced  significantly  longer  sequence  reads  than  the 
dye-primer  chemistry.  Therefore  the  rest  of  the  ran¬ 
dom-phase  sequencing  was  performed  using  the  dye- 
terminator  chemistry.  Over  23,000  individual  se¬ 
quences  were  collected,  which  was  equivalent  to 
about  lOx  coverage  of  the  chromosome.  This  is 
greater  coverage  than  is  normally  done  in  a  shotgun 
project,  but  the  excess  coverage  was  thought  to  be 
necessary  to  compensate  for  the  presence  of  non¬ 
chromosome  2  DNA  in  the  library  arising  from  the 
pulsed-field  gel  purification  of  the  DNA,  and  for  the 
expected  non-randomness  of  the  shotgun  library  due 
to  the  AT-rich  inserts. 

The  sequence  reads  were  assembled  using  a  ver¬ 
sion  of  TIGR  Assembler  (Sutton  et  al.y  1995)  that 
was  extensively  modified  to  assemble  the  AT-rich 


and  repeat-rich  Plasmodium  sequences.  TIGR  As¬ 
sembler  identifies  and  aligns  overlapping  fragments 
in  two  steps.  The  initial  step  in  assembly  is  to  locate 
all  n- mer  oligonucleotides  shared  between  fragment 
pairs.  The  software  views  all  fragment  pairs  with  a 
high  degree  of  w-mer  similarity  as  potentially  over¬ 
lapping,  and  in  the  second  step  the  Smith-Waterman 
method  is  used  to  align  the  fragments.  In  the  bacte¬ 
rial  genome  projects  the  value  of  n  used  was  typi¬ 
cally  10-12  nucleotides.  However,  using  n~  10  with 
AT-rich  Plasmodium  DNA  resulted  in  incorrect 
identification  of  thousands  of  potential  fragment 
overlaps,  so  that  the  program  spent  an  inordinate 
amount  of  time  attempting  to  align  the  spurious 
matches.  Increasing  n  from  10  to  32  much  reduced 
this  problem  and  significantly  lowered  the  time  re¬ 
quired  for  assembly. 

After  the  assembly,  6 1 0  contigs  were  obtained  and 
the  largest  contigwas  50  kb.  Neighboring  contigs 
were  identified  and  ordered  by  the  program 
GROUPER,  which  searches  for  plasmid  templates 
with  forward  and  reverse  reads  in  different  contigs 
(clone  links),  and  for  overlapping  contigs  that  failed 
to  merge  under  the  stringent  overlap  criteria  re¬ 
quired  by  TIGR  Assembler  (grasta  links).  Contigs 
within  a  group  are  separated  by  sequence  gaps 
which  can  be  closed  by  primer  walking  on  the  tem¬ 
plates  identified  as  clone  links,  or  by  editing  of  the 
termini  of  contigs  with  grasta  links.  The  ends  of 
groups  represent  physical  gaps  for  which  no  shotgun 
clone  could  be  identified.  Ten  groups  of  114  contigs 
were  localized  on  the  chromosome  by  comparison  to 
STS  markers  (Lanzer  et  al.,  1993).  Closure  of  phys¬ 
ical  and  sequence  gaps  used  approaches  described 
previously  (Fleischmann  et  al. ,  1995),  with  a  few 
modifications  to  compensate  for  the  AT-richness  of 
the  DNA.  To  close  the  9  physical  gaps  in  the  central 
region  of  the  chromosome,  PCR  reactions  using  ge¬ 
nomic  DNA  as  template  were  performed  with 
primers  from  the  ends  of  adjacent  groups.  PCR 
products  were  purified  and  sequenced  using  dye-ter¬ 
minator  chemistry.  This  process  closed  3  physical 
gaps  immediately,  but  PCR  products  from  2  gaps 
contained  very  AT-rich  sequence  which  could  not  be 
sequenced  completely,  and  remained  as  sequence 
gaps.  Those  physical  gaps  for  which  PCR  products 
could  not  be  obtained  in  the  first  step  were  reasoned 
to  be  too  large  for  PCR,  and  to  contain  one  or  more 
of  the  unlocalized  groups.  We  therefore  performed 
combinatorial  PCRs  with  one  primer  from  the  end 
of  a  localized  group  and  the  second  primer  from  the 
ends  of  all  free  groups  larger  than  2.5  kb.  Two  gaps 
were  closed  by  the  combinatoral  strategy.  Finally,  1 
physical  gap  was  closed  after  editing  and  reassembly, 
and  another  gap  was  closed  by  sequencing  of  a 
‘missing  mate’  (i.e.,  resequencing  of  a  clone  for 
which  either  the  forward  or  reverse  sequencing  re¬ 
action  had  failed  during  the  random  phase).  Five 
methods  were  used  to  close  sequence  gaps.  For  con¬ 
tigs  which  overlapped  but  had  not  been  merged  dur¬ 
ing  assembly,  editing  and  resequencing  were  per- 
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formed  to  close  the  gaps.  Many  sequence  gaps  were 
caused  by  artifacts  in  dye-primer  reactions,  particu¬ 
larly  in  extremely  AT-rich  areas.  Long  homopolymer 
stretches  of  up  to  50  consecutive  A  or  T  residues  al¬ 
so  caused  the  sequence  quality  to  decline  down¬ 
stream  of  the  homopolymer  region.  These  artifacts 
either  prevented  the  merging  of  overlapping  contigs 
or  produced  short  sequences  that  did  not  extend  to 
the  neighboring  contig.  Some  of  these  problem  areas 
could  be  solved  by  trimming  of  the  low  quality  se¬ 
quence  that  prevented  merging  of  the  contigs.  For 
other  gaps,  templates  from  short  or  low-quality  dye- 
primer  reactions  in  the  vicinity  of  sequence  gaps 
were  identified  and  resequenced  with  dye-terminator 
chemistry;  the  longer  reads  of  high-quality  sequence 
provided  by  the  dye-terminator  reactions  was  suffi¬ 
cient  to  close  many  gaps.  For  those  gaps  that  re¬ 
mained,  primer  walking  on  plasmid  templates  link¬ 
ing  adjacent  contigs  was  used.  Finally,  there  were  5 
sequence  gaps  that  could  not  be  closed  by  the  above 
methods  because  the  sequence  was  too  AT-rich  for 
primer  synthesis  and  walking.  To  close  these  gaps, 
the  artificial  transposon  AT-2  (Devine  and  Boeke, 
1994)  was  inserted  into  one  of  the  templates  span¬ 
ning  each  sequence  gap,  multiple  subclones  of  each 
template  were  sequenced  using  transposon-specific 
primers,  and  the  sequences  were  assembled  to  close 
the  gap.  The  chromosome  2  sequence  was  edited 
manually  using  TIGR  Editor,  and  where  necessary 
additional  sequencing  reactions  were  performed  to 
improve  coverage  and  resolve  sequence  ambiguities. 
One  major  concern,  given  the  well-known  propensi¬ 
ty  for  AT-rich  P  falciparum  sequences  to  rearrange 
in  E .  coli ,  was  whether  the  assembled  sequence  was 
an  accurate  representation  of  the  genomic  sequence. 
To  independently  confirm  the  colinearity  of  the  as¬ 
sembled  sequence  and  genomic  DNA,  Nhe I  and 
BamHl  optical  restriction  maps  of  chromosome  2 
DNA  were  prepared  and  compared  with  restriction 
maps  predicted  from  the  sequence  (Jing  et  al , 
1999).  The  relative  error  of  predicted  and  observed 
fragment  sizes  was  less  than  6%,  which  proved  that 
there  were  no  major  rearrangements  in  the  assem¬ 
bled  sequence. 

Annotation  of  R  falciparum  chromosome  2 

Annotation  of  the  chromosome  2  sequence  followed 
the  procedures  used  previously  during  the  annota¬ 
tion  of  other  genomes,  including  BLAST  searching 
of  all  open  reading  frames  (ORFs)  against  a  protein 
sequence  database.  In  addition,  to  assist  in  defining 
the  intron/exon  boundaries,  a  new  eukaryotic  gene 
finding  program  was  developed  specifically  for  use 
in  this  project  (Salzberg  et  al ,  1999).  This  program, 
GlimmerM,  was  trained  on  a  set  of  117  P.  falci¬ 
parum  sequences  taken  from  Genbank.  Gene  mod¬ 
els  based  on  the  GlimmerM  predictions,  the  similar¬ 
ity  of  the  ORFs  to  known  proteins,  and  prediction  of 
putative  signal  peptides  and  transmembrane  do¬ 
mains  were  constructed. 


Chromosome  2  of  R  falciparum  is  947,103  bp  in 
length  and  80.2%  A+T  (Gardner  et  al,  1998).  It 
possesses  typical  eukaryotic  telomeres  and  subtelom- 
eric  regions  containing  several  kb  of  rep20  tandem 
repeats,  variant  antigen  genes  ( var ),  and  a  potential 
new  family  of  variant  surface  antigens  related  to  the 
RIF-1  elements  (repetitive  interspersed  family)  (We¬ 
ber,  1988).  The  large  central  region  encodes  many 
single  copy  genes  and  several  genes  that  are  tandem- 
ly  repeated  (Fig.  1).  Two  hundred  and  nine  protein¬ 
encoding  genes  and  a  gene  encoding  tRNAGlu  were 
predicted  on  chromosome  2,  giving  a  gene  density  of 
one  gene  per  4.5  kb,  which  is  significantly  lower  than 
in  yeast  (one  gene  per  2  kb)  but  higher  than  in  G.  el- 
egans  (one  gene  per  7  kb).  It  was  estimated  that  43 
of  the  209  protein-encoding  genes  contained  at  least 
one  intron,  with  most  such  genes  consisting  of  2  or  3 
exons.  Two  genes,  however,  contained  8  exons.  Ex¬ 
trapolation  of  the  chromosome  2  data  to  the  entire 
28  Mb  P.  falciparum  genome  suggests  that  it  contains 
6,200  genes,  2,600  of  which  may  contain  introns. 
Thus,  in  terms  of  intron  content  and  gene  density  the 
P.  falciparum  genome  appears  to  be  intermediate  be¬ 
tween  the  compact  yeast  genome  and  the  intron-rich 
genomes  of  multicellular  eukaryotes. 

Of  the  209  protein  encoding  genes,  only  87  (42%) 
appeared  to  have  homologs  outside  Plasmodium , 
suggesting  that  almost  60%  of  the  genes  encoded  on 
this  chromosome  are  so  far  'unique’  to  Plasmodium. 
The  proportion  of  unique  genes  is  almost  2-fold* 
greater  than  has  been  observed  in  other  organisms, 
and  confirms  that  there  is  much  biology  to  be  un¬ 
covered  in  future  studies  of  this  parasite.  As  se¬ 
quencing  of  other  related  parasites  proceeds,  some 
of  these  proteins  will  undoubtedly  be  found  to  have 
homologs  in  apicomplexans  such  as  Toxoplasma 
(Ajioka  et  al. ,  1998)  and  Eimeria ,  and  hence  may  be 
found  to  be  characteristic  of  apicomplexan  parasites. 
Most  of  the  remaining  unidentified  proteins  on  chro¬ 
mosome  2  were  predicted  to  consist  primarily  of 
non-globular  domains,  i.e.  domains  that  are  com¬ 
posed  of  low  complexity  sequences  that  do  not  form 
compact  folded  structures  (Wootton  and  Federhen, 
1996).  The  abundance  of  non-globular  domains  or 
proteins  in  Plasmodium  was  very  unusual,  and  was 
about  half  that  observed  in  S.  cerevisiae,  C.  elegans , 
and  humans.  In  addition,  13  proteins  contained 
large  regions  (>30  amino  acids)  with  predicted  non- 
globular  structure  inserted  directly  into  globular  do¬ 
mains,  a  phenomenon  so  far  unique  to  Plasmodium. 
These  non-globular  insertions  did  not  exhibit  the 
AT-bias  typical  of  introns,  were  not  flanked  by  con¬ 
sensus  splice  sites,  and  based  on  RT-PCR  analysis  of 
several  genes  encoding  non-globular  domains,  were 
likely  to  be  expressed  in  the  proteins.  The  abun¬ 
dance  of  the  non-globular  domains  in  Plasmodium 
proteins  suggests  that  they  provide  as  yet  unknown 
selective  advantages  to  the  parasite.  Study  of  these 
proteins  containing  non-globular  inserts  may  also 
provide  new  insight  into  the  general  principles  of 
protein  folding. 
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Fig.  1.  Map  of  P.  falciparum  chromosome  2  (clone  3D7).  Exons  are  shown  as  boxes  or  arrows,  with  introns  represented  by 
thin  lines  connecting  the  exons.  Other  features  such  as  telomeric  and  subtelomeric  repeats  are  indicated  as  shown  in  the 
legend.  Chromosome  2  genes  with  similarity  to  known  genes  in  the  sequence  databases  and  for  which  putative  functional 
assignments  could  be  made  are  stippled;  hypothetical  genes  with  no  detectable  similarity  to  known  genes  are  indicated  by 
vertical  stripes;  genes  with  similarity  to  previously  sequenced  genes  of  unknown  function  are  indicated  as  open  arrows.  The 
rifin  and  var  genes  are  labeled  with  'R'  and  ‘V\  respectively.  Genes  were  given  systematic  names  using  a  scheme  similar 
to  that  devised  for  the  S.  cereviseae  genome  (Mewes  et  a/.,  1997).  For  a  complete  description  of  the  genes  encoded  on 
chromosome  2,  including  details  of  functional  assignments,  see  Gardner  et  al.  (1998). 


Most  of  the  87  evolutionarily-conserved  proteins 
encoded  on  chromosome  2  show  the  greatest  simi¬ 
larity  to  eukaryotic  homologs  or  belong  to  specifi¬ 
cally  eukaryotic  protein  families.  Many  of  these 
genes  code  for  proteins  that  participate  in  replica¬ 
tion,  repair,  transcription,  or  translation,  and  include 
the  origin  recognition  complex  subunit  5,  two  pro¬ 
teins  involved  in  excision  repair  proteins,  several 
proteins  involved  in  chromatin  dynamics,  RNA- 
binding  proteins,  and  a  putative  transcription  factor. 
Other  evolutionary  conserved  proteins  are  involved 
in  secretion,  such  as  the  SEC61  gamma  subunit,  the 
coated  pit  coatamer  subunit,  and  syntaxin,  suggest¬ 
ing  early  emergence  of  the  eukaryotic  secretory  sys¬ 
tem.  Five  proteins  contained  DnaJ  domains;  in  other 
organisms  DnaJ  proteins  have  been  shown  to  act  as 
cofactors  for  the  HSP70-type  molecular  chaperones 
and  to  participate  in  a  variety  of  processes  such  as 
protein  folding  and  trafficking,  complex  assembly, 
organelle  biogenesis,  and  initiation  of  translation 


(Cyr  et  al ,  1994).  Chromosome  2  encodes  90  pre¬ 
dicted  membrane  proteins,  some  of  which  appear  to 
be  transporters  of  amino  acids  or  sugars.  Five  puta¬ 
tive  protein  kinases  were  also  identified,  suggesting 
that  the  P.  falciparum  genome  may  encode  about 
150  protein  kinases.  This  prominence  of  regulators 
is  in  striking  contrast  to  the  situation  in  bacterial 
pathogens,  which  appear  to  have  shed  most  of  the 
regulatory  systems,  and  is  probably  a  reflection  of 
the  complex  life  cycle.  For  example,  phosphorylation 
and  dephosphorylation  reactions  are  known  to  be  in¬ 
volved  in  the  development  and  sexual  differentiation 
of  malaria  parasites  (Bracchi  et  al. ,  1996).  A  cluster 
of  8  tandemly  arranged  genes  encoding  putative  pro¬ 
teases  was  also  found;  3  of  these  genes  were  known 
previously  and  were  called  SERAs  (SErine  Repeat 
Antigens).  The  expansion  of  this  protease  gene  fam¬ 
ily  suggests  an  important  function,  possibly  in  mero- 
zoite  release  from  schizonts  or  processing  of  mero- 
zoite  surface  proteins. 
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While  most  of  the  evolutionarily  conserved  pro¬ 
teins  were  more  similar  to  eukaryotic  homologs,  16 
proteins  were  significantly  more  similar  to  bacterial 
homologs  and  4  other  proteins  were  the  first  eu¬ 
karyotic  representatives  of  conserved  bacterial  pro¬ 
tein  families.  These  proteins  may  have  been  trans¬ 
ferred  to  the  nuclear  genome  from  an  organellar 
genome  after  the  divergence  of  the  apicomplexa 
from  the  other  eukaryotic  lineages.  Several  of  these 
proteins  contained  N-terminal  sequences  that  resem¬ 
bled  organellar  import  peptides,  which  suggested 
that  these  proteins  may  be  imported  into  and  func¬ 
tion  within  either  the  apicoplast  or  the  mitochondri¬ 
on.  Of  particular  interest  were  3  genes  encoding 
proteins  involved  in  fatty  acid  metabolism.  One  of 
these  proteins,  3-ketoacyl-ACP  synthase  III  (FabH), 
catalyzes  the  condensation  of  acetyl-CoA  and  mal- 
onyl-ACP  in  Type  II  (dissociated)  fatty  acid  synthase 
systems.  Type  II  synthase  systems  are  restricted  to 
bacteria  and  the  plastids  of  plants,  and  the  discovery 
of  a  Type  II  fatty  acid  synthase  system  in  Plasmodi¬ 
um  reinforced  previous  hypotheses  that  the  api¬ 
coplast  contains  plant-like  metabolic  pathways  dis¬ 
tinct  from  those  of  the  host  (Wilson  et  al.,  1991; 
Slabas  and  Fawcett,  1992).  Some  of  the  biochemical 
processes  that  occur  within  this  organelle  may  there¬ 
fore  be  good  drug  targets  (Soldati,  1999).  Recent 
work  has  confirmed  that  at  least  some  of  the  pre¬ 
dicted  import  peptides  can  direct  translocation  of  re¬ 
porter  proteins  into  the  apicoplast  in  Toxoplasma , 
and  in  addition,  that  thiolactomycin,  a  specific  in¬ 
hibitor  of  bacterial  FabH,  can  inhibit  the  growth  of 
P.  falciparum  in  vitro  (Waller  et  al ,  1998). 

As  mentioned  previously,  more  than  half  of  all  pro¬ 
teins  encoded  on  chromosome  2  did  not  have  de¬ 
tectable  homologs  in  other  species.  Many  of  the  Plas¬ 
modium  specific  genes  were  located  in  the  sub- 
telomeric  regions  of  the  chromosome.  Two  members 
of  the  var  gene  family  were  identified  on  chromo¬ 
some  2,  one  in  each  sub-telomeric  region.  The  var 
genes  encode  large  proteins,  collectively  known  as 
PfEMPls,  that  are  located  on  the  surface  of  infected 
red  cells,  exhibit  extensive  sequence  diversity,  and 
are  involved  in  antigenic  variation,  cytoadherence, 
and  rosetting  (Baruch  et  al. ,  1995;  Smith  et  al , 
1995;  Su  et  al ,  1995;  Rowe  et  al,  1997).  Most  var 
genes  are  located  in  sub-telomeric  regions,  and  var 
gene  diversity  is  thought  to  be  generated  by  recom¬ 
bination  between  alleles,  a  process  which  might  be 
facilitated  by  the  sub-telomeric  repeats  (Rubio  et  al, 
1996).  Six  small  ORFs  that  had  similarity  to  var  se¬ 
quences  were  also  found  in  the  sub-telomeric  re¬ 
gions.  Five  of  these  ORFs  resembled  the  var  exon  II 
cDNAs  or  the  Pf60.1  sequences  that  were  reported 
previously  (Su  et  al,  1995;  Bonnefoy  et  al,  1997). 
However,  the  largest  gene  family  identified  on  chro¬ 
mosome  2  encoded  proteins  of  27-35  kD  that  were 
named  rifins,  after  the  RIF-1  repetitive  elements 
(Weber,  1988).  These  proteins  contained  a  N-termi¬ 
nal  signal  sequence,  a  central  region  of  variable 
length  and  an  amino  acid  sequence  containing  con¬ 


served  cysteine  residues,  a  transmembrane  domain, 
and  a  C-terminus  rich  in  basic  amino  acids,  and  were 
predicted  to  be  expressed  on  the  surface  of  infected 
red  cells.  All  eighteen  of  the  rifin  genes  were  in  the 
subtelomeric  regions,  centromere  proximal  to  the 
var  genes.  Clusters  of  rifin  genes  have  been  detected 
on  other  chromosomes  (Cheng  et  al ,  1998),  and  if 
the  number  of  rifins  found  on  chromosome  2  is  rep¬ 
resentative  of  the  other  chromosomes,  the  P  falci¬ 
parum  genome  may  contain  more  than  500  rifin 
genes.  While  the  function  of  the  rifins  is  not  known, 
the  extensive  sequence  diversity  of  the  rifins  suggests 
that,  like  the  var  gene  products,  they  may  be  clonal- 
ly  variant.  Further  studies  are  underway  in  a  number 
of  laboratories  to  confirm  the  subcellular  localization 
of  the  rifins  and  to  determine  their  function. 

Future  prospects 

The  completion  of  the  first  P.  falciparum  chromo¬ 
some  and  the  rapid  progress  being  made  by  all  three 
genome  centers  on  the  remaining  chromosomes 
(Table  1)  suggests  that  the  entire  P.  falciparum 
genome  will  be  completed  within  2-3  years.  In  fact, 
it  is  quite  likely  that  most  of  the  parasite’s  genes  will 
have  been  identified  within  18-24  months,  with  the 
additional  time  being  spent  on  the  closing  of  gaps  in 
the  sequence.  Ideally,  the  completion  of  the  P.  falci¬ 
parum  genome  sequence  will  be  followed  by  the  se¬ 
quencing  of  a  second  Plasmodium  species  so  as  to 
provide  valuable  comparative  information.  The  hu¬ 
man  parasite  P.  vivax  and  several  rodent  malaria 
parasites  used  as  model  systems  for  vaccine  and 
drug  development  are  currently  viewed  as  candi¬ 
dates  for  sequencing.  In  addition,  information  de¬ 
rived  from  expressed  sequence  tag  (EST)  or  genome 
sequencing  projects  for  other  apicomplexa  such  as 
Toxoplasma  (Ajioka  et  al,  1998)  will  help  to  identi¬ 
fy  parasite-specific  metabolic  pathways  that  will  be 
useful  for  development  of  new  drugs  against  these 
organisms.  Recent  technological  advances  such  as 
the  stable  transfection  of  several  Plasmodium 
species  (van  Dijk  et  al. ,  1995;  Wu  et  al,  1995; 
Crabb  and  Cowman,  1996;  van  der  Wei  et  al ,  1997) 
and  the  ability  to  knock-out  specific  genes  (Menard 
et  al,  1996;  Crabb  et  al,  1997),  and  the  develop¬ 
ment  of  microarray  technologies  for  global  measure¬ 
ments  of  gene  expression  (Schena  et  al,  1995),  will 
help  in  the  interpretation  of  the  genome  sequence. 
This  is  important  in  view  of  the  fact  that  less  than 
one-half  of  all  the  genes  identified  on  the  first  P.  fal¬ 
ciparum  chromosome  to  be  sequenced  could  be  as¬ 
signed  functional  roles.  Clearly,  there  is  much  excit¬ 
ing  research  to  be  done  and  researchers  studying 
Plasmodium  and  related  parasites  can  look  forward 
to  Bloom’s  post-genomic  era  of  microbe  biology. 
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The  genome  of  the  human  malaria  parasite  Plasmodium 
falciparum  is  being  sequenced  by  an  international  consortium. 
Two  of  the  parasite’s  1 4  chromosomes  have  been  completed 
and  several  other  chromosomes  are  nearly  finished.  Even  at 
this  early  stage  of  the  project,  analysis  of  the  genome 
sequence  has  provided  promising  new  leads  for  drug  and 
vaccine  development. 
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Abbreviations 

CTL  cytotoxic  T  lymphocyte 

EST  expressed  sequence  tag 

GST  gene  sequence  tag 

STS  sequence-tagged  site 

YAC  yeast  artificial  chromosome 

Introduction 

Over  one-third  of  the  world’s  population  is  at  risk  of  con¬ 
tracting  malaria,  a  mosquito-borne  disease  caused  by 
apicomplexan  parasites  of  the  genus  Plasmodium.  There 
are  -300-500  million  new  cases  and  ~  1 .5—2.7  million 
deaths  from  malaria  annually.  Most  deaths  due  to  malaria 
occur  among  children  in  sub-Saharan  Africa  [1],  At  present 
there  is  no  effective,  practical  vaccine  that  can  be  used  to 
prevent  malaria,  and  although  there  are  effective  anti- 
malarial  drugs,  resistance  to  one  of  more  of  these  drugs  has 
developed  in  many  parts  of  the  world.  Development  of 
new  drugs  and  vaccines  has  been  only  moderately  suc¬ 
cessful,  limited  by  the  financial  resources  that  are  available 
and  the  difficulty  of  working  with  a  complex  intracellular 
parasite.  (A  comprehensive  collection  of  review  articles  on 
all  aspects  of  Plasmodium  biology  can  be  found  here  [2*].) 

Completion  of  the  first  microbial  genome  sequences 
demonstrated  the  benefits  that  accrue  from  genome 
sequencing  [3].  For  a  pathogenic  organism,  the  genome 
sequence  provides  the  sequence  of  every  potential  drug  or 
vaccine  target;  for  difficult  to  study  organisms  like  Plasmod¬ 
ium ,  sequencing  of  the  genome  may  be  the  only  way  to 
identify  these  targets.  The  Plasmodium  falciparum  genome 
is  approximately  28  megabase  pairs  (Mb)  in  length  and  con¬ 
tains  14  chromosomes  ranging  in  size  from  -0.6-3.4  Mb. 
Chromosome  sizes  can  vary  markedly  between  wild  isolates 
as  a  result  of  recombination  events  involving  the  repeat-rich 
subtelomenc  regions  of  the  chromosome.  The  genome  is 
extremely  A+T  rich  (-80%),  which  might  account  for  the 
instability  of  large  fragments  of  P  falciparum  DNA  in  E.  coli. 
The  DNA  is  more  stable  in  yeast;  large  insert  yeast  artificial 
chromosome  (YAC)  libraries  have  been  constructed  and 


used  to  generate  STS  (sequence-tagged  site)  maps  of  most 
of  the  chromosomes  [4].  In  addition,  a  linkage  map  of  the 
genome  consisting  of  more  than  900  microsatellite  markers 
and  having  a  resolution  of  30  kb  has  been  produced  [5**]. 
Expressed  sequence  tags  (ESTs)  from  blood  stage  parasites 
and  gene  sequence  tags  (GSTs)  have  also  been  prepared 
[6,7].  Techniques  for  manipulation  of  the  genome  have 
been  developed  including  stable  transfection  and  gene 
knockouts  [8**].  This  review  summarizes  recent  progress  in 
the  sequencing  of  the  P.  falciparum  genome,  and  outlines 
how  the  genome  sequence  information  produced  in  this 
effort  is  contributing  to  the  development  of  new  drugs  and 
vaccines  against  malaria. 

The  Plasmodium  falciparum  genome 
sequencing  project 

P  falciparum  is  the  most  lethal  of  the  four  Plasmodium 
species  that  cause  malaria  in  humans.  Fortunately,  all 
stages  of  the  P.  falciparum  life  cycle  can  be  maintained  in 
the  laboratory,  blood  stages  can  be  cultured  routinely,  and 
cloned  parasites  are  available.  In  late  1996,  a  consortium  of 
funding  agencies,  genome  centers,  and  malaria  investiga¬ 
tors  was  formed  to  sequence  the  Plasmodium  falciparum 
genome  [9,10].  A  strategy  was  adopted  whereby  individual 
chromosomes  assigned  to  each  genome  center  were 
resolved  by  pulsed  field  gel  electrophoresis  and  subjected 
to  shotgun  sequencing.  STS  markers  [4],  the  microsatellite 
linkage  map  [5**,11],  and  optical  restriction  maps  [12**,  13] 
of  the  chromosomes  were  used  for  ordering  of  the  con¬ 
tiguous  sequences  during  the  gap  closure  phase  and  for 
verification  of  the  final  sequence  assembly.  Chromosomes 
2  and  3,  which  comprise  about  7%  of  the  genome,  have 
been  completed  [14#*,15**];  preliminary  data  at  various 
stages  of  completion  are  available  for  the  remaining  chro¬ 
mosomes  (Table  1).  One  difficulty  faced  by  the 
sequencing  groups  was  the  identification  of  genes  in  the 
A+T-rich  sequence.  Gene  finding  algorithms  developed 
for  higher  eukaryotes,  which  have  a  much  lower  gene  den¬ 
sity  than  Plasmodium ,  were  not  optimal  for  the  prediction 
of  coding  regions  in  Plasmodium  DNA,  and  prokaryotic 
gene  finders  were  unable  to  predict  introns.  GlimmerM 
gene  finding  software  was  developed  during  the  chromo¬ 
some  2  project;  it  uses  interpolated  Markov  models 
constructed  from  a  training  set  of  well-characterized  genes 
for  prediction  of  coding  regions  and  a  separate  module  for 
prediction  of  splice  sites  [16]. 

The  chromosome  sequences  revealed  that  20-30  kb  of 
each  chromosome  end  was  composed  of  telomeric,  rep20, 
and  other  repeats  [14**,15**].  Centromeric  to  these  repeats, 
members  of  multigene  families  involved  in  antigenic  varia¬ 
tion  and  or  pathogenesis  were  found  [17*],  including  var 
genes  that  encode  the  PfEMPl  proteins  [18-21],  open 
reading  frames  with  similarity  to  the  3'  exon  of  var  genes 
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Table  1 


Web  sites  related  to  the  malaria  genome  sequencing  project. 

Web  site 

Content 

URL 

P.  falciparum  chromosomes 

2,  10,  11,  14,  TIGR/Naval 

Medical  Research  Center 

Chromosome  2  annotation  [1 4**] 
Preliminary  data 

http://www.tigr.org/tdb/mdb/pfdb/pfdb.html 

P.  falciparum  chromosomes 

1,3,4,  5-9,13, 

The  Sanger  Centre 

Chromosome  3  annotation  [1 5"] 
Preliminary  data 

http://www.sanger.ac.uk/Projects/P_falciparum/ 

P.  falciparum  chromosome  1 2, 
Stanford  University 

Preliminary  data  for 
chromosome  1 2 

http://sequence  www.stanford.edu/group/malaria/ 
index.html 

P.  falciparum  Gene  Sequence, 

Tag  Project  University  of  Florida 

A  collection  of  ESTs  and  GSTs 
for  P.  falciparum  [6,7] 

http://parasite.arf.ufl.edu/malaria.html 

Malaria  Database, 

Monash  University,  Walter 
and  Eliza  Hall  Institute 

A  collection  of  genetic 
information  on  malaria  parasites 

http://www.wehi.edu.au/MalDB-www/who.html 

Malaria  Genetics  and 

Genomics,  National  Center 
for  Biotechnology  Information 
(NCBI) 

BLAST  searches  on  Apicomplexan 
sequence  data,  including 

P.  falciparum ;  P.  falciparum 
linkage  maps,  etc. 

http://www.ncbi.nlm.nih.gov/Malaria/ 

Parasite  Genomes  Blast 

Server,  European 

Bioinformatics  Institute 

BLAST  searches  on  sequence 
data  from  many  parasites,  including 
Plasmodium 

http://www.embl-ebi.ac.uk/parasites/parasite_blast_ 

server.html 

Malaria  Foundation 

General  information  on  malaria  and 
many  links  to  malaria-related  sites 

http://www.malaria.org/index.htm 

TIGR  Microbial  Database 

A  comprehensive  listing  of 
microbial  genome  projects 

http://www.tigr.org/tdb/mdb/mdb.html 

BLAST,  basic  local  alignment  search  tool;  dbEST,  database  of  expressed  sequence  tags;  GSTs,  genome  sequence  tags. 


that  may  represent  a  distinct  gene  family  [22],  and  mem¬ 
bers  of  the  n/and  STEVOR  gene  families  (see  below).  Gene 
density  was  about  1  gene  per  4.7  kb  and  almost  one-half  of 
genes  were  predicted  to  contain  introns.  Depending  upon 
the  methods  used  for  annotation,  up  to  two-thirds  of  the 
genes  identified  had  no  detectable  orthologs  in  other  organ¬ 
isms,  which  suggests  that  our  current  understanding  of 
malaria  parasite  biology  is  woefully  incomplete. 

The  investment  in  sequencing  of  the  genome  has  already 
paid  handsome  dividends.  A  large  gene  family  ( rif)  was 
identified  on  chromosome  2  [14**]  (the  STEVOR  family  was 
proposed  to  be  a  family  related  to,  but  distinct  from,  the  rif 
'  family  [23**]).  The  rif- genes  encoded  polypeptides  of 
27-35  kDa  (rifins)  that  were  predicted  to  be  located  on  the 
red  cell  surface  and  which  contained  a  region  variable  in 
length  and  amino  acid  sequence.  The  sequence  polymor¬ 
phism  of  the  rifins,  their  presumed  cell  surface  localization, 
and  the  distribution  of  the  rif  genes  in  subtelomeric  regions 
near  the  var  genes  suggested  that  rifins  might  be  a  new 
class  of  variant  surface  antigen.  Laboratory  studies  have 
now  proven  that  the  rifins  are  expressed  on  the  surface  of 
the  infected  red  cell  and  that  they  are  clonally-variant  but 
the  function  of  these  proteins  has  not  been  determined 


[24**].  Like  the  PfEMPl  proteins  encoded  by  the  var  gene 
family,  which  mediate  cytoadherence  and  rosetting,  the 
rifins  might  have  a  role  in  host-parasite  interaction. 

Other  major  findings  included  the  discovery  of  genes 
encoding  enzymes  of  the  type  II  fatty  acid  biosynthetic 
pathway  that  were  previously  found  only  in  plants  and  bac¬ 
teria  [14**,25**],  a  cluster  of  four  genes  of  unknown  function 
that  was  repeated  on  one  end  of  both  chromosomes  2  and  3, 
and  the  identification  of  putative  centromere  sequences 
[15**].  The  predicted  centromeres  (-2-3  kb  in  length)  were 
located  in  the  most  A+T  rich  region  of  each  chromosome 
(>97%  A+T),  which  in  both  cases  were  under  represented  in 
the  plasmid  shotgun  libraries  used  for  sequencing  and  were 
the  most  difficult  regions  to  sequence.  Proof  that  these 
regions  actually  are  centromeres  awaits  improvements  in 
transfection  technology;  however,  if  these  are  centromeres 
they  could  be  useful  for  the  stable  maintenance  of  minichro¬ 
mosomes  in  transfected  parasites. 

Identification  of  new  chemotherapeutic  targets 
using  the  genome  sequence 

Investigation  of  a  35  kb  extra-chromosomal  DNA  with  fea¬ 
tures  characteristic  of  plastid  DNA  by  Wilson  and 
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colleagues  [26]  led  to  the  identification  of  an  organelle  in 
Plasmodium ,  Toxoplasma ,  and  related  parasites  called  the 
apicoplast  [27-30].  Early  studies  revealed  that  organellar 
protein  synthesis  and  DNA  replication  were  targets  of 
antibiotics  with  antimalarial  activity  (for  reviews  see 
[31,32]).  Analysis  of  the  complete  sequence  of  the  35  kb 
DNA  provided  few  clues  to  the  function  of  the  organelle 
but,  like  plastids  of  higher  plants,  the  organelle  was 
hypothesized  to  contain  biochemical  pathways  essential 
for  cell  survival.  If  such  pathways  were  parasite  specific 
they  would  make  attractive  drug  targets. 

Because  most  proteins  in  the  plastids  of  other  organisms 
are  encoded  by  nuclear  DNA  and  imported  into  plastids,  it 
was  clear  that  the  genes  encoding  the  enzymes  of  these 
pathways  were  to  be  found  in  the  nuclear  genome.  Analy¬ 
sis  of  the  genome  sequence  in  conjunction  with 
transfection  studies  in  Toxoplasma  have  led  to  the  identifi¬ 
cation  of  nuclear-encoded  proteins  that  are  imported  into 
the  apicoplast  and  the  amino-terminal  sequences  that 
direct  the  transport  of  these  proteins  into  the  organelle 
[25**].  The  fabW  gene  encoding  3-ketoacyl  acyl  carrier  pro¬ 
tein  synthase  III  —  an  enzyme  involved  in  type  II  fatty 
acid  biosynthesis  —  was  identified  on  chromosome  2  in 
R  falciparum  and  shown  to  contain  a  putative  apicoplast- 
targeting  peptide.  The  antibiotic  thiolactomycin,  which 
inhibits  the  orthologous  bacterial  enzyme,  was  shown  to 
possess  growth- inhibitory  activity  against  P.  falciparum 
in  vitro  [25**]. 

Most  recently,  genes  encoding  enzymes  of  the  non-meval- 
onate  pathway  of  isoprenoid  biosynthesis  were  identified 
in  preliminary  data  from  the  chromosome  14  sequencing 
project  and  the  enzymes  were  predicted  to  be  localized  in 
the  apicoplast  [33**, 34]).  Inhibitors  of  one  enzyme  in  the 
pathway  (1-deoxy-D-xylulose  5-phosphate  reductoiso- 
merase)  were  found  to  inhibit  the  activity  of  the 
recombinant  enzyme  expressed  in  bacteria  and  to  possess 
antiparasite  activity  in  vitro  and  in  vivo.  These  examples 
validate  the  early  interest  in  plastid-localized  pathways  as 
drug  targets,  and  demonstrate  the  rapidity  with  which 
potential  drug  targets  can  be  identified  with  genome 
sequence  information. 

Investigators  searching  for  new  drug  targets  have  also  found 
plant-like  biochemical  pathways  in  apicomplexan  parasites 
that  may  not  be  located  in  the  apicoplast  (e.g.  the  shikimate 
pathway)  using  more  conventional  approaches  [35*, 36]. 
Other  potential  targets  not  related  to  the  apicoplast  have 
also  been  identified  via  gene  sequence  information  [37].  As 
the  sequencing  of  the  genome  proceeds  it  will  be  possible 
to  construct  an  increasingly  comprehensive  view  of  parasite 
metabolism  (the  ‘metabolome’),  which  should  permit  the 
identification  of  many  more  novel  drug  targets.  Successful 
exploitation  of  these  novel  targets  may  reduce  reliance  on 
current  antimalarials  to  which  resistance  has  developed  and 
permit  the  development  of  multi-drug  therapies  that  may 
slow  the  development  of  resistance  in  the  future. 


Genome  sequence  and  vaccine  development 

The  P.  falciparum  genome  sequence  will  also  provide  the 
amino  acid  sequences  of  all  potential  vaccine  antigens. 
Characterization  of  the  hundreds  or  thousands  of  antigens 
to  be  identified  from  the  genome  sequence  and  their  for¬ 
mulation  into  effective  vaccines  will  be  a  formidable 
task  —  one  made  more  difficult  by  the  requirement  that 
each  vaccine  must  elicit  the  appropriate  immune  response 
for  targeting  of  the  different  stages  of  the  parasite  life  cycle 
[38,39].  One  proposed  approach  is  to  clone  individual 
P  falciparum  genes  or  long  open  reading  frames  into  DNA 
vaccines,  generate  antisera  to  the  encoded  proteins  in 
mice,  and  use  immunofluorescence  assays  to  determine 
the  expression  patterns  and  subcellular  localization  of  the 
candidate  antigens  in  the  parasite  [40**].  Antigens 
expressed  only  within  infected  hepatocytes,  which  are  tar¬ 
geted  primarily  by  CD8+  T  cell  responses,  could  be 
screened  via  computer  algorithms  to  predict  cytotoxic 
T  lymphocyte  (CTL)  epitopes.  The  CTL  epitopes  could 
be  combined  into  a  series  of  multi-epitope  DNA  vaccine 
constructs  and  multicomponent  DNA  vaccines  encoding 
many  full-length  liver  stage  antigens  could  also  be  pre¬ 
pared.  Blood  stage  antigens  accessible  to  antibodies  could 
also  be  formulated  into  DNA  vaccines.  Clinical  trials  to 
establish  immunogenicity  and  protective  efficacy  of  the 
vaccines  would  follow.  Pilot  projects  using  genes  from  the 
two  completed  chromosomes  could  be  used  to  validate  this 
approach  prior  to  its  application  on  a  large  scale.  Other 
approaches  to  the  use  of  genome  data  for  vaccine  develop¬ 
ment  are  also  possible,  including  scaling-up  of  the  current 
antigen-by-antigen  strategy  using  rodent  malaria  orthologs 
to  P  falciparum  antigens,  or  targeted  expression  library 
immunization  techniques  [41]. 

Comparative  genomics 

Four  species  of  Plasmodium  are  currently  known  to  infect 
humans.  P.  falciparum  is  by  far  the  most  lethal  of  the  four 
species,  but  P  vivax ,  P.  malariae ,  and  P.  ovale  cause  signifi¬ 
cant  morbidity.  P.  vivax  is  the  most  prevalent  of  these  and 
is  of  increasing  concern  because  of  the  development  of 
chloroquine  resistance.  Apart  from  the  sequencing  of 
genes  encoding  potential  vaccine  antigens  and  drug  tar¬ 
gets,  comparatively  little  molecular  biology  has  been  done 
with  these  parasites,  primarily  because  they  are  extremely 
difficult  or  impossible  to  culture  continuously  in  vitro  [42] 
and  must  be  maintained  in  primates.  Carlton  et  al.  [43*] 
have  produced  karyotype  maps  of  the  three  other  human 
Plasmodia.  Like  P  falciparum ,  these  species  appear  to  have 
14  chromosomes  but  their  genomes  may  be  10-15  Mb  larg¬ 
er  than  the  P.  falciparum  genome,  possibly  as  a  result  of 
differences  in  the  amount  of  subtelomeric  non-coding 
DNA.  Four  synteny  groups  common  to  all  four  species 
were  identified,  which  suggests  that  gene  order  has  been 
preserved  across  species  in  many  cases.  Because  P.  vivax  is 
the  second  most  important  human  malaria  and  exhibits 
numerous  biological  characteristics  that  differ  from  P.  falci¬ 
parum,  it  is  quite  likely  that  the  P.  vivax  genome  will  be 
sequenced;  an  EST  gene  discovery  project  has  already 
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been  initiated.  Comparison  of  the  P.  falciparum  and  P.  vivax 
genomes  should  enable  the  identification  of  genes  respon¬ 
sible  for  the  biological  and  pathogenicity  differences 
between  the  two  species.  In  addition,  sequence  data  from 
murine  Plasmodia  and  related  parasites  such  as  Toxoplasma 
(Table  1)  and  Theileria  [44*]  will  help  to  define  apicom- 
plexan  specific  genes. 


Conclusions 

Tremendous  progress  towards  an  understanding  of  Plasmod¬ 
ium  biology  has  been  made  over  the  past  decade.  We  can 
expect  the  rate  of  progress  to  increase  in  the  next  decade 
once  the  complete  genome  sequence  of  P.  falciparum  is 
determined.  This  information,  coupled  with  improvements 
in  areas  such  as  informatics,  transfection  technology,  and  the 
development  of  oligonucleotide  [45]  and  glass  slide  microar¬ 
rays  [46]  for  examination  of  gene  expression  on  a 
genome-wide  scale,  will  allow  investigators  to  delve  into 
areas  of  Plasmodium  biology  that  are  so  far  unexplored. 
These  discoveries  will  provide  a  much  more  complete  pic¬ 
ture  of  malaria  parasite  biology  and  facilitate  the 
development  of  new  drugs  and  vaccines  to  combat  malaria. 

Note  added  in  proof 

An  important  new  work  on  P  falciparum  restriction  map¬ 
ping  has  just  been  published  [48**]. 
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This  is  an  extension  of  the  work  reported  by  Jing  et  al.  [1 2**},  where  optical 
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parum  chromosome  2.  In  this  paper,  an  optical  restriction  map  of  the  complete 
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Optical  restriction  maps  may  be  very  useful  in  the  sequencing  of  Plasmodium 
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